Improvements to RIVET's Input Handling

rivetTDA / rivet

RIVET is a tool for Topological Data Analysis, in particular two-parameter persistent homology.

GNU General Public License v3.0

76 stars 24 forks source link

Improvements to RIVET's Input Handling #149

Closed hackerde closed 3 years ago

hackerde commented 4 years ago

The distance matrix class is something similar to bifiltration data. It groups the functionalities related to a distance matrix and also introduces some extra features for computing function values from given data points.

Features:

data_reader.cpp now is more condensed. Reading a distance matrix or computing it from a point cloud is moved to the DistanceMatrix class.
Introduction of the --function flag. This allows the user to compute function values from the given data. Options are ball density, gaussian density or eccentricity. User can specify a parameter for the computation, or a default will be used.
GUI also has the feature to supply these functions
The full distance matrix is read in compared to before (when only values less than max_dist were read in).

Other features:

Modified script convert.py to convert old files to new input format.
Added output tests for rivet_console.

This branch is merged with input.

mlesnick commented 4 years ago

I have started looking at the changes implemented here, focusing on using the new file input dialogue box. This is a very nice improvement, though there are some things that need to be fixed.

Here are some initial comments, after only very preliminary testing:

-The box to set the distance value to inf is cut off on my screen (on a MacBook)

-It may be a good idea to label the numerical parameter in the file dialogue window. I think that currently this parameter is only used to define a density estimate. Still, since future uses could be different, it could be labeled something generic like: function param.

-I see some incorrect behavior when using the file senate_RIVET_Codensity20thPercentile.txt in the Data/Roadmap_Benchmark_Data folder. (I did not look to see whether the file was converted correctly, so I don't know if this is an issue with the file or with RIVET itself).

When I compute the degree-Rips bifiltration, with 40x40 bins, I get this picture:

For this same input, I get a different incorrect output (something that is far from correct, not shown here) when I set the max scale parameter to inf, even though in this file, the max scale parameter is already equal to the largest distance.

For comparison, the correct picture, obtained by instead using the file "senate_RIVET_no_function.txt", is this:

I'm not 100% sure what the desired behavior is for old RIVET file formats, but my testing indicates that there are some compatibility issues with the old file type. For example: -When choosing a user-defined function, the numerical parameter is greyed out. -You are allowed to choose to select to use a user-defined function, even when none is provided in the file, which then leads to a crash. -RIVET is crashing with a user-defined function provided in the old format, even when the correct options are selected.

mlesnick commented 4 years ago

At Matthew's suggestion, I have started testing the new version of the this pull request. My tests so far focus on the file input GUI, not command-line interactivity, and on the examples that come included with RIVET. This is not systematic testing; I'm just trying a few things out.

Here are my findings so far:

It looks like all the bugs I mentioned earlier have been addressed.

The built-in coeccentricity function looks to not be working correctly, based on experiments with the file network379_RIVET_Coeccentricity.txt in the "roadmap" folder. The same incorrect behavior is appearing regardless of whether I use this file in the new format created by Anway's script or the old format.

I did also notice another curious bug, which in fact exists already in the main branch, and is not Anway's doing. This is now posted as issue 156.

Matthew mentioned privately that he sees a crash with circle_300pts_density.csv. I also see this

One other suggestion (not a bug per se):

A density function or eccentricity function should usually be used with the descending filtration order by default, but the user may not know this. So when RIVET's user-defined density function is selected, the "reverse box" should be checked by default.

mlwright84 commented 4 years ago

Thanks, Anway, for your work. RIVET no longer crashes when loading circle_300pts_density.csv.

I have noticed another problem when testing CSV files with and without function values. I just added a file circle_300pts_nofunction.csv, which contains the same points as circle_300pts_density.csv, but without the density values. RIVET should produce the same output for both of these files when the user chooses to have RIVET compute function values for the function-RIPS filtration, but I am seeing different output. I have tried H_0 and H_1 homology, with balldensity and knndensity functions.

For example, run RIVET with the following data and parameters:

file: circle_300pts_density.csv type: points_fn homology: 1 max dist: 8 filtration: function function: balldensity parameter: 2 x-bins: 10 y-bins: 10 x-reverse

The output looks reasonable.

Then compare with:

file: circle_300pts_nofunction.csv type: points all other parameters same as above

RIVET displays an empty diagram.

hackerde commented 4 years ago

According to the data structures, when the tokens vector is filled up in data_reader.cpp, it expects either a function value associated with the point, or a 0 as an artificial birth value. This value is required, even if it is later ignored when new function values are computed in the DistanceMatrix class. The function values were added but not used in the circle_300pts_density.csv, but a 0 was not added for circle_300pts_nofunction.csv. I fixed that and it gave me the same output.

I figured this out by printing out the actual distance matrix it was using during computation of the function values and found discrepancies. I believe this may also fix (?) the eccentricity issue mentioned by Mike. If not, an example of what is going wrong would help me debug that.

mlesnick commented 4 years ago

This new commit seems to fix the error with the eccentricity function. Thank you Anway.

On Sat, May 30, 2020 at 1:21 AM Anway De notifications@github.com wrote:

According to the data structures, when the tokens vector is filled up in data_reader.cpp, it expects either a function value associated with the point, or a 0 as an artificial birth value. This value is required, even if it is later ignored when new function values are computed in the DistanceMatrix class. The function values were added but not used in the circle_300pts_density.csv, but a 0 was not added for circle_300pts_nofunction.csv. I fixed that and it gave me the same output.

I figured this out by printing out the actual distance matrix it was using during computation of the function values and found discrepancies. I believe this may also fix (?) the eccentricity issue mentioned by Mike. If not, an example of what is going wrong would help me debug that.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/rivetTDA/rivet/pull/149#issuecomment-636279136, or unsubscribe https://github.com/notifications/unsubscribe-auth/AC72KBHGWU23HRKEB4WQOSDRUCJWVANCNFSM4JVS3X2Q .

mlesnick commented 4 years ago

After looking again, it seems that there is still an issue with the eccentricity function. To see this, run a computation with network379_RIVET_Coeccentricity, using H0 and the user function (x-direction not reversed) Then run the same computaiton using the build in eccentricity function (keep the axis direction reversed, as per the default.)

Modulo change in the numerical bounds on the viewable region of the line selection window, one should get the same output. But that is not happening.

mlesnick commented 4 years ago

Another small issue with the file input GUI is illustrated in the screenshot below. Look how the numbers are placed too far to the left in the max distance box. This is for the file network379_RIVET_Coeccentricity. I did not see this issue with several other RIVET files I opened.

I'm guessing the issue is that the max distance is specified a larger precision than expected. (This max distance is one of the actual distances between points in the data set, and was chosen automatically by a script in the creation of the file.)

mlwright84 commented 4 years ago

In testing rivet_GUI, I have noticed two issues:

"user" function not selectable: When I load circle_300pts_density.csv, select file type points_fn, filtration function, the function option user is not active. The user option should be active whenever the file type is points_fn or metric_fn.
x-axis reversed for bifiltration and FIRep data: RIVET is improperly reversing the x-axis and also changing the x-axis label to "degree" for bifiltration and FIRep data. This results in errors in the persistence module and sometimes a crash.

mlesnick commented 4 years ago

I had an occasion to use the new branch to do a computation with the firep datatype, via the GUI. The computation did not work correctly.

In addition (perhaps relatedly), the file input dialogue did not work as expected: 1)The reverse option was not available even though it should be for this kind of data, I think. 2)When I specified the reverse manually, this option did not register in the GUI. I am attaching the FIRep file I used in the old format, which can be easily modified to the new format for debugging. (But when doing this, remember to include the flag --xreverse.) I suggest to run the example with 100x100 bins. It is a small example and will compute immediately.
Disk_Data_Min_Pres_Converted.txt

mlesnick commented 4 years ago

I have been using the pull request a a lot recently with the students in my TDA practicum course. This is going well. I feel this is a major improvement to the usability of RIVET. I have to say the file input dialogue is VERY convenient.

I see a some minor problems:

1)When using the ball density estimator, I occasionally notice the reverse button not being selected by default. It could be that this only happens when I load multiple files in the same RIVET session. (But for none of these am I deselecting the reverse button.)

2)When loading multiple files in the same session, RIVET doesn't refresh the Betti number selection boxes correctly. I'm not sure about the other boxes.

3)When given a square CSV matrix representing a distance matrix (very common scenario), RIVET assumes this is a point cloud. This is not an error, in the strictest sense, since in principle this could be a point cloud, but since that is very unlikely, that seems like undesirable behavior. Would it be possible to scan the first row and the columns of the data to see whether this data has the shape of a distance matrix, and set the default accordingly? If this is too much effort, then the message RIVET gives "This file appears to contain point cloud data" should be changed to "This file appears to contain point cloud or metric data."

ttlam01 commented 4 years ago

I'm using RIVET for my TDA course. I have some problems with using rivet_console in command-line with flag --function

flag "--function gaussiandensity" should be "--function gaussian"
omitting parameters for gaussian, balldensity and eccentricity causes errors.

hackerde commented 4 years ago

The modifications of this branch does not affect FIREP computations. However, defaults have been set for all datatype and IIRC, we decided that for FIREP data, "xreverse" is always false and cannot be changed. Even if it is specified in the input file or on the command line with the --xreverse, RIVET ignores it and so the behavior @mlesnick described is correct. Do let me know if this needs to be changed and I will make the necessary modifications.

Indeed, the option for the --function flag is gaussian and not gaussiandensity. I believe the documentation has to be updated to reflect that. If @mlwright84 and @mlesnick think that gaussiandensity is the desired option, then let me know and I will make the change.

@lamthanhtung01 When you omit parameters, you have to keep the brackets. For example: --function balldensity[] and not --function balldensity. I believe the documentation needs to be updated in that part as well. The square brackets are there to remind the user that there is an option for setting a parameter and they are voluntarily using the default.

3)When given a square CSV matrix representing a distance matrix (very common scenario), RIVET assumes this is a point cloud. This is not an error, in the strictest sense, since in principle this could be a point cloud, but since that is very unlikely, that seems like undesirable behavior. Would it be possible to scan the first row and the columns of the data to see whether this data has the shape of a distance matrix, and set the default accordingly? If this is too much effort, then the message RIVET gives "This file appears to contain point cloud data" should be changed to "This file appears to contain point cloud or metric data."

This is the desired behavior. The reason it says "This file appears to contain point cloud data" is to convey to the user that the file will be processed with that assumption and they can change the type in the datatype box. It is possible to scan the first row, but to scan the first column, the entire file needs to be read in. This is possible of course. However, changing to "This file appears to contain point cloud or metric data." creates an uncertainty as to what would happen if the user were to hit compute without changing anything in the dialog box.

@mlesnick I will look into the other two issues regarding GUI (hopefully) next week. Thank you for noticing and letting me know!

mlesnick commented 4 years ago

Thank you Anway,

The modifications of this branch does not affect FIREP computations. However, defaults have been set for all datatype and IIRC, we decided that for FIREP data, "xreverse" is always false and cannot be changed. Even if it is specified in the input file or on the command line with the --xreverse, RIVET ignores it and so the behavior @mlesnick described is correct. Do let me know if this needs to be changed and I will make the necessary modifications.

I see. This convention is problematic. We should have the flexibility to reverse either of the coordinate directions. The old version of RIVET allows for this and this is currently being used in another project.

Indeed, the option for the --function flag is gaussian and not gaussiandensity. I believe the documentation has to be updated to reflect that. If @mlwright84 and @mlesnick think that gaussiandensity is the desired option, then let me know and I will make the change.

I don't remember how the earlier discussions on this went; I guess I would now vote for "--gaussian" in spite of the inconsistency with "--balldensity." EDIT: I spoke to Matthew and he agrees that we should go with "--gaussian." We will fix the documentation.

@lamthanhtung01 When you omit parameters, you have to keep the brackets. For example: --function balldensity[] and not --function balldensity. I believe the documentation needs to be updated in that part as well. The square brackets are there to remind the user that there is an option for setting a parameter and they are voluntarily using the default.

3)When given a square CSV matrix representing a distance matrix (very common scenario), RIVET assumes this is a point cloud. This is not an error, in the strictest sense, since in principle this could be a point cloud, but since that is very unlikely, that seems like undesirable behavior. Would it be possible to scan the first row and the columns of the data to see whether this data has the shape of a distance matrix, and set the default accordingly? If this is too much effort, then the message RIVET gives "This file appears to contain point cloud data" should be changed to "This file appears to contain point cloud or metric data."

This is the desired behavior. The reason it says "This file appears to contain point cloud data" is to convey to the user that the file will be processed with that assumption and they can change the type in the datatype box.

This may be the behavior that we agreed on, but after seeing it in action, I feel that it is not a good design to display this message. It is sure to cause confusion, and is not consistent with the list of available file type options below in the GUI.

It is possible to scan the first row, but to scan the first column, the entire file needs to be read in. This is possible of course.

I see. I guess a solution which avoids this is preferable in general, since the files can be large.

However, changing to "This file appears to contain point cloud or metric data." creates an uncertainty as to what would happen if the user were to hit compute without changing anything in the dialog box.

Not sure I agree, since the default choice for file type is displayed clearly in the selector below. Still, perhaps there is a better way to address this than what I proposed. Would it be better in this case to instead have a message along the lines of "Specify the data type via the selector below."?

Related to this, should "file type" be changed to "data type" in the GUI? (There seems to be an inconsistency in nomenclature now.)

@mlesnick I will look into the other two issues regarding GUI (hopefully) next week. Thank you for noticing and letting me know!

Thank you!

mlesnick commented 4 years ago

While writing the last message, I noticed that the help text for rivet_console could use some polish; just flagging this for now so we don't forget about it before merging.

hackerde commented 3 years ago

1)When using the ball density estimator, I occasionally notice the reverse button not being selected by default. It could be that this only happens when I load multiple files in the same RIVET session. (But for none of these am I deselecting the reverse button.)

2)When loading multiple files in the same session, RIVET doesn't refresh the Betti number selection boxes correctly. I'm not sure about the other boxes.

3)When given a square CSV matrix representing a distance matrix (very common scenario), RIVET assumes this is a point cloud. This is not an error, in the strictest sense, since in principle this could be a point cloud, but since that is very unlikely, that seems like undesirable behavior. Would it be possible to scan the first row and the columns of the data to see whether this data has the shape of a distance matrix, and set the default accordingly? If this is too much effort, then the message RIVET gives "This file appears to contain point cloud data" should be changed to "This file appears to contain point cloud or metric data."

I made the necessary changes to fix these issues. I also enabled the axis reversal options for FIRep data. For the display message when a datatype is not selected, I decided to go with "This file does not contain the datatype. Please select the appropriate type below." and not mention that it is CSV data since RIVET can take in both CSV and space separated now. Let me know if there are further issues or changes necessary.

mlesnick commented 3 years ago

Thank you Anway. I have done a brief test of your most recent update. As far as I can tell, all the fixes work well. In the line "This file does not contain the datatype," I have changed "contain" to "specify", which I think is slightly more precise. In the next days, Matthew plans to fix the minor issues in the documentation that you identified, and after this, I believe all the issues we flagged will have been resolved. At that point, we can do a last sweep of the documentation for typos and light polish, then merge.