How best to update geo module?

jbburt commented 4 years ago

@rmarkello -- do you have thoughts on this?

my inclination is to replace my cortex function with your get_surface_distance function (I see you also wrote it in such a way that renaming get_surface_distance to cortex would preserve backwards compatibility -- nice!).

then, parcellate and subcortex could either be left as-is, or subcortex could be revised such that it has (approximately) parallel structure with cortex. I think the latter would be pretty straightforward to implement using _get_euclid_distance and _get_parcel_distance, right?

one final thought -- in my original implementation I had functionally modularized the distance-computation and parcellation routines, in large part because i) parcellated is a derivative of dense, and ii) because (I assumed?) it would take far longer to recompute the dense vertex-vertex distances and then parcellate, than it would to read vertex-vertex distances from disk and then parcellate. Was my assumption wrong (or in your implementation, is this a non-issue)?

rmarkello commented 4 years ago

Hey! A few notes and thoughts:

First, get_surface_distance() will only be "perfectly" backwards-compatible if use_wb=True is set to be the default (and even then, there might be some rounding errors due to floating point imprecision). As is, with use_wb=False as the default, the biggest issue / difference is that the current calculation method under-estimates the distance between indirect edges. When you're comparing parcel-parcel distances this ends up amounting to very small absolute differences (<0.5mm, in my experience), but when you're looking at vertex-vertex distance it can be more noticeable.

It might be worth changing it so, for now, use_wb=True is the default—or I can do some more brainstorming on the math to address that issue. I'll make a new issue with the exact problems with the indirect edge distance calculation to see if it's something that can be readily tackled (I only ever used the parcel-parcel distance matrix and was willing to take the minor imprecision for the speed-ups.)

The re-formatting of parcellate() and subcortex() is kind of tied in to your original rationale for maintaining them as separate entities. I totally get the logic that parcellated is a derivative of dense, but I'm not sure whether re-calculating the dense vertex-vertex distances is actually slower than re-loading the dense vertex-vertex distances from disk! I did a little test to compare:

In [1]: from brainsmash.workbench.geo import get_surface_distance, parcellate

In [2]: %time parcellate('lh_geodesic_dist.txt', 
                         dlabel_file='Gordon333.L.32k_fs_LR.label.gii', 
                         outfile='lh_gordon_dist_1.txt')
CPU times: user 13min 44s, sys: 4min 58s, total: 18min 43s
Wall time: 18min 43s

In [3]: %time get_surface_distance('S1200.L.midthickness_MSMAll.32k_fs_LR.surf.gii',
                                   outfile='lh_gordon_dist_2.txt', 
                                   dlabel='Gordon333.L.32k_fs_LR.label.gii')
CPU times: user 9min 41s, sys: 59.1 ms, total: 9min 41s
Wall time: 9min 41s

where lh_geodesic_dist.txt in the first call to parcellate() is just the vertex-vertex distance matrix output by get_surface_distance() without a dlabel file. (I just happened to have Gordon on hand, but I assume the estimates would be similar for the MMP.)

It seems like the disk I/O is generally the slower beast and actually recomputing the whole vertex-vertex distance matrix—assuming you're doing it in-memory with get_surface_distance()—is faster than loading the matrix in from disk. For comparison, running get_surface_distance() without providing a dlabel file has a wall time of 23 minutes; the increase from above is almost entirely attributable to the I/O. (Note: these times will very likely vary by hardware!) I think if you wanted to consider alternative formats for the vertex-vertex distance matrix the disk I/O might see some speed-ups, but since you were also originally worried about memory constraints and loading the whole matrix in-memory (which is a very important consideration!) that makes things a bit trickier.

All-told, I'm not sure there's a perfect way to maintain perfect API backwards compatibility and maximize the speed-ups. If you're comfortable replacing cortex() with get_surface_distance() and then making the API of subcortex() match I think that would make a lot of sense. As for parcellate(): you can leave it as is (so that it doesn't disappear from the API) and just suggest that, if people want, they can use the dlabel parameter of the new cortex() (née get_surface_distance()) and to-be-modified subcortex() functions to generate parcellated distance matrices? Everyone who has vertex-vertex distance matrices saved to disk that they want to use can then still rely on parcellate() to get the job done.

jbburt commented 4 years ago

@rmarkello -- thanks so much for the insightful answer. I tried to implement everything as you suggested. Everything seems to be working fine but I've yet to thoroughly stress test it so let me know if you spot anything that seems off!

murraylab / brainsmash

How best to update geo module? #6