Closed jbburt closed 4 years ago
Hey! A few notes and thoughts:
First, get_surface_distance()
will only be "perfectly" backwards-compatible if use_wb=True
is set to be the default (and even then, there might be some rounding errors due to floating point imprecision). As is, with use_wb=False
as the default, the biggest issue / difference is that the current calculation method under-estimates the distance between indirect edges. When you're comparing parcel-parcel distances this ends up amounting to very small absolute differences (<0.5mm, in my experience), but when you're looking at vertex-vertex distance it can be more noticeable.
It might be worth changing it so, for now, use_wb=True
is the default—or I can do some more brainstorming on the math to address that issue. I'll make a new issue with the exact problems with the indirect edge distance calculation to see if it's something that can be readily tackled (I only ever used the parcel-parcel distance matrix and was willing to take the minor imprecision for the speed-ups.)
The re-formatting of parcellate()
and subcortex()
is kind of tied in to your original rationale for maintaining them as separate entities. I totally get the logic that parcellated is a derivative of dense, but I'm not sure whether re-calculating the dense vertex-vertex distances is actually slower than re-loading the dense vertex-vertex distances from disk! I did a little test to compare:
In [1]: from brainsmash.workbench.geo import get_surface_distance, parcellate
In [2]: %time parcellate('lh_geodesic_dist.txt',
dlabel_file='Gordon333.L.32k_fs_LR.label.gii',
outfile='lh_gordon_dist_1.txt')
CPU times: user 13min 44s, sys: 4min 58s, total: 18min 43s
Wall time: 18min 43s
In [3]: %time get_surface_distance('S1200.L.midthickness_MSMAll.32k_fs_LR.surf.gii',
outfile='lh_gordon_dist_2.txt',
dlabel='Gordon333.L.32k_fs_LR.label.gii')
CPU times: user 9min 41s, sys: 59.1 ms, total: 9min 41s
Wall time: 9min 41s
where lh_geodesic_dist.txt
in the first call to parcellate()
is just the vertex-vertex distance matrix output by get_surface_distance()
without a dlabel file. (I just happened to have Gordon on hand, but I assume the estimates would be similar for the MMP.)
It seems like the disk I/O is generally the slower beast and actually recomputing the whole vertex-vertex distance matrix—assuming you're doing it in-memory with get_surface_distance()
—is faster than loading the matrix in from disk. For comparison, running get_surface_distance()
without providing a dlabel
file has a wall time of 23 minutes; the increase from above is almost entirely attributable to the I/O. (Note: these times will very likely vary by hardware!) I think if you wanted to consider alternative formats for the vertex-vertex distance matrix the disk I/O might see some speed-ups, but since you were also originally worried about memory constraints and loading the whole matrix in-memory (which is a very important consideration!) that makes things a bit trickier.
All-told, I'm not sure there's a perfect way to maintain perfect API backwards compatibility and maximize the speed-ups. If you're comfortable replacing cortex()
with get_surface_distance()
and then making the API of subcortex()
match I think that would make a lot of sense. As for parcellate()
: you can leave it as is (so that it doesn't disappear from the API) and just suggest that, if people want, they can use the dlabel
parameter of the new cortex()
(née get_surface_distance()
) and to-be-modified subcortex()
functions to generate parcellated distance matrices? Everyone who has vertex-vertex distance matrices saved to disk that they want to use can then still rely on parcellate()
to get the job done.
@rmarkello -- thanks so much for the insightful answer. I tried to implement everything as you suggested. Everything seems to be working fine but I've yet to thoroughly stress test it so let me know if you spot anything that seems off!
@rmarkello -- do you have thoughts on this?
my inclination is to replace my
cortex
function with yourget_surface_distance
function (I see you also wrote it in such a way that renamingget_surface_distance
tocortex
would preserve backwards compatibility -- nice!).then,
parcellate
andsubcortex
could either be left as-is, orsubcortex
could be revised such that it has (approximately) parallel structure withcortex
. I think the latter would be pretty straightforward to implement using_get_euclid_distance
and_get_parcel_distance
, right?one final thought -- in my original implementation I had functionally modularized the distance-computation and parcellation routines, in large part because i) parcellated is a derivative of dense, and ii) because (I assumed?) it would take far longer to recompute the dense vertex-vertex distances and then parcellate, than it would to read vertex-vertex distances from disk and then parcellate. Was my assumption wrong (or in your implementation, is this a non-issue)?