Connectome mapping execution duration

sina-mansour / UKB-connectomics

This repository will host scripts used to map structural and functional brain connectivity matrices for the UK biobank dataset.

https://www.biorxiv.org/content/10.1101/2023.03.10.532036v1

61 stars 7 forks source link

Connectome mapping execution duration #21

Closed sina-mansour closed 2 years ago

sina-mansour commented 2 years ago

I have started testing the runtime for a more practical number of total streamlines/seeds. I used 10 million seeds (-seeds 10M) which in turn resulted in ~2.2M streamlines remaining after ACT.

Here is a short summary of the computation times:

Preprocessing steps, FOD estimation, MSMT CSD: ~30mins Tractography + sift weight calculations: ~2h30mins Mapping connectivity matrices: ~2h30mins

I personally think the last step is where we could act smarter to save a great deal of time. Let me first explain why this is taking so long. Basically, it takes about 10 seconds to run tck2connectome once (for a single atlas, and a single metric). However, the current code is mapping connectivity on 92 alternative atlases (23 cortical x 4 subcortical, see #20) and connectomes are being mapped for 10 alternative metrics (streamline count, FBC, length, MD, etc., see #10. Hence it takes a total of 10 92 10 seconds = ~ 150 mins.

Now I think there are multiple ways we could save time:

For instance, we can opt to select a subsample of all atlases available. If we only use 2 resolutions of the subcortex atlas (scale 1 and 4) and use only one-half of Schaeffer's atlases (only the ones defined for the 7 functional networks) that will result in about 75% reduction in execution time.
Currently, tck2connectome is being called 920 times, every time with a different atlas and metric. However, there's a great deal of redundancy in reading 2M streamlines every single time. If we could possibly add this functionality to tck2connectome to only read the streamlines once, and map multiple connectomes using different choices of atlases and metrics every time that could be a great addition to MRtrix which could both benefit this project and future users. I could alternatively write some python code that does the same but without tck2connectome.
There's also a third option that I personally am in favor of more than the first two: we can alternatively provide the following: (i) metrics (length, sift weights, mean FA, etc.) sampled for every streamline (ii) a resampled tractogram that only contains the endpoints of every streamline, and (iii) the atlases used. With these sets of files, the user can compute any of the connectivity matrices using tck2connectome in ~10 seconds. We could optionally provide a single connectivity map for a single atlas combination that we deem to be the best. This way, those less familiar with connectivity analyses get a connectivity matrix to use, and the experts can get to pick the metric & atlas to map their desired connectivity matrics.

This last option would reduce the connectivity mapping procedure to a couple of minutes at most. Furthermore, this would also reduce file storage requirements to save everything. (currently, the 920 connectomes mapped for every subject is around 800MB, which could be reduced to ~80 MB for all atlases + ~80 MB for the streamline endpoints + ~20MB for connectomes on a single atlas)

(This alternative would also enable mapping high-resolution connectomes using endpoint information if we wanted to do that in a future project.)

Please let me know what you think is the optimal solution in your opinion.

AndrewZalesky commented 2 years ago

I think that we can consider a compromise solution that strikes a balance between options 1-3.

Two resolutions of the subcortex atlas is definitely adequate and using only have of the Schaeffer's atlases sounds reasonable.
Rather than only providing one matrix and streamline endpoint, what about we consider a compromise and provide say 5 matrices + endpoints.
My only worry with providing streamline endpoints is that some users might not know how to generate a matrix based on endpoints. Or they simply don't want to go to through the hassle of installing software to do the computation. So I think that we should cater for users wanting to download connectivity matrices and thus provide 5 or so parcellation options.
I am surprised that the storage size of all endpoints is equal to the storage size of a connectivity matrix? Let's say we have a 200 node parcellation and a connectivity density of 50%, this would require storage of 9950 values. Whereas if we have 1 million valid streamlines, we would need to store 10^6 x 3 x 2 = 6 million values.

On 13/02/2022 9:17 pm, Sina Mansour L. wrote:

I have started testing the runtime for a more practical number of total streamlines/seeds. I used 10 million seeds (|-seeds 10M|) which in turn resulted in ~2.2M streamlines remaining after ACT.

Here is a short summary of the computation times:

Preprocessing steps, FOD estimation, MSMT CSD: ~30mins Tractography + sift weight calculations: ~2h30mins Mapping connectivity matrices: ~2h30mins

I personally think the last step is where we could act smarter to save a great deal of time. Let me first explain why this is taking so long. Basically, it takes about 10 seconds to run tck2connectome once (for a single atlas, and a single metric). However, the current code is mapping connectivity on 92 alternative atlases (23 cortical x 4 subcortical, see #20 https://github.com/sina-mansour/UKB-connectomics/issues/20) and connectomes are being mapped for 10 alternative metrics (streamline count, FBC, length, MD, etc., see #10 https://github.com/sina-mansour/UKB-connectomics/issues/10. Hence it takes a total of 10 92 10 seconds = ~ 150 mins.

Now I think there are multiple ways we could save time:

1.
For instance, we can opt to select a subsample of all atlases
available. If we only use 2 resolutions of the subcortex atlas
(scale 1 and 4) and use only one-half of Schaeffer's atlases (only
the ones defined for the 7 functional networks) that will result
in about 75% reduction in execution time.
2.
Currently, tck2connectome is being called 920 times, every time
with a different atlas and metric. However, there's a great deal
of redundancy in reading 2M streamlines every single time. If we
could possibly add this functionality to tck2connectome to only
read the streamlines once, and map multiple connectomes using
different choices of atlases and metrics every time that could be
a great addition to MRtrix which could both benefit this project
and future users. I could alternatively write some python code
that does the same but without tck2connectome.
3.
There's also a third option that I personally am in favor of more
than the first two: we can alternatively provide the following:
(i) metrics (length, sift weights, mean FA, etc.) sampled for
every streamline (ii) a resampled tractogram that only contains
the endpoints of every streamline, and (iii) the atlases used.
With these sets of files, the user can compute any of the
connectivity matrices using tck2connectome in ~10 seconds. We
could optionally provide a single connectivity map for a single
atlas combination that we deem to be the best. This way, those
less familiar with connectivity analyses get a connectivity matrix
to use, and the experts can get to pick the metric & atlas to map
their desired connectivity matrics.
This last option would reduce the connectivity mapping procedure to a couple of minutes at most. Furthermore, this would also reduce file storage requirements to save everything. (currently, the 920 connectomes mapped for every subject is around 800MB, which could be reduced to ~80 MB for all atlases + ~80 MB for the streamline endpoints + ~20MB for connectomes on a single atlas)

(This alternative would also enable mapping high-resolution connectomes using endpoint information if we wanted to do that in a future project.)

Please let me know what you think is the optimal solution in your opinion.

— Reply to this email directly, view it on GitHub https://github.com/sina-mansour/UKB-connectomics/issues/21, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXR7AXGK2MOINL7SSFQNKTU26AK5ANCNFSM5OIV5SRQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you are subscribed to this thread.Message ID: @.***>

Lestropie commented 2 years ago

Useful to see the empirical times; I was concerned from the outset that this contribution to runtime may be too much, but couldn't justify investing the development time before it was proven to be necessary.

I would personally leave this as a last-resort option; either of the other two options provide end-users with greater possibilities, and therefore make it less likely that another group may deem our contribution inadequate and therefore essentially repeat this project in order to get the data they want.
I'd already considered the prospect of augmenting tck2connectome to receive a 4D parcellation image with one independent parcellation per volume. This would reduce the number of calls by the number of unique parcellations. It would require a decent amount of coding effort, but it's certainly achievable. Doing the same with contrasts is however slightly trickier. Theoretically I could have it also accept a matrix input for the per-streamline factors, and it would e.g. generate a matrix per column. However the fact that different matrices require different edge-wise statistics makes it difficult to collapse the whole thing into a single tck2connectome invocation; it'd need to be at least 2.

The development effort here would likely be quite a bit more for me to get it done in MRtrix3 compared to yourself "just getting it done" in Python; but it would be computationally faster and more general. I'll try to have a look at the relevant code at some point and make a guesstimate as to how long it might take me.
Even if one can compute the connectivity matrix from a parcellation image & tractogram containing only endpoints & quantitative value per streamline, it nevertheless introduces MRtrix3 as a requisite into the pipeline for users who may have a tailored interest in network analysis and have never used MRtrix3. So even if tractogram endpoint data are provided, I'm not sure that it is an adequate substitute for the pre-calculation of matrices.

There is however a strategy whereby I can foresee this working. We could provide a dedicated tool for performing the final step of connectome generation based on the provided pre-generated data. This would essentially be a container that would wrap the tck2connectome functionality, based on the filesystem naming scheme in which our pre-generated data are provided.

The user would specify:
- Cortical parcellation;
- Sub-cortical parcellation;
- Quantitative value per streamline;
- Inclusion or exclusion of SIFT2 weights;
- Edge-wise statistical metric (e.g. "mean" for mean length per edge"; "sum" for streamline count / FBC)
and the tool would yield the corresponding connectivity matrix.

This has a couple of potential advantages:
- Reduce storage requirements at the UKB end;
- Potentially extensible; i.e. new parcellations could potentially be added after the fact.
- Potentially use superior streamline-to-node assignment mechanism in the pipeline (though this is probably less relevant for your data)
- Potentially add capability for computing median per edge, which is not yet implemented;
- Could later augment with CSS (#22), with options configurable by the end-user rather than having only a finite set of parameters chosen by us.

Across these I think my preference would actually be for option 3. I have a little experience now with container construction so can hopefully point you in the right direction.

sina-mansour commented 2 years ago

I am surprised that the storage size of all endpoints is equal to the storage size of a connectivity matrix? Let's say we have a 200 node parcellation and a connectivity density of 50%, this would require storage of 9950 values. Whereas if we have 1 million valid streamlines, we would need to store 10^6 x 3 x 2 = 6 million values.

It's actually the multiple forks that increase the size, here's a detailed breakdown: For atlases in smaller resolutions (such as the combination of 100 cortical parcels from Schaeffer and the coarsest subcortical atlas) a single connectome would only be 26kB. For higher resolution atlases, this goes up to 2MB for a single connectome (~1000 nodes). On average a single connectivity matrix will be around 1MB.

However, we are mapping 10 different measures and 92 different atlases which eventually sum up to ~800MB.

Whereas a single file containing all endpoints is ~80MB which is 4000 times larger than the smallest connectome size.

My only worry with providing streamline endpoints is that some users might not know how to generate a matrix based on endpoints. Or they simply don't want to go to through the hassle of installing software to do the computation. So I think that we should cater for users wanting to download connectivity matrices and thus provide 5 or so parcellation options.

I agree that we should provide matrices for users looking for prepared connectivity matrices. If we want to provide multiple parcellations, we can multiple lower resolution parcellations commonly used and a single one in higher resolution:

DK atlas + subcortex scale one (200kB)
Destrieux atlas + subcortex scale one (800kB)
Glasser atlas + subcortex scale one (3MB)
Glasser atlas + subcortex scale four (4MB)
Schaefer 500 nodes atlas + subcortex scale four (8MB)

Essentially the coarser atlases are a lot smaller. While using these coarse atlases solves the storage problem, it will still take considerable time to compute the connectomes (~10mins for 5 atlases and 10 measures) which we may need to optimize for.

Mapping the 5 connectomes mentioned above + providing endpoints and streamline metrics would take a total of <400MB (before compression):

~ 15MB for 50 connectivity matrices on 5 atlases ~ 80MB for endpoints ~ 80MB for 92 atlas combinations ~ 250MB for streamline metrics

In terms of saving storage, we may also want to reduce the number of metrics we map as currently we're mapping a variety of DTI metrics that are not that widely used in the context of connectivity matrices (FA, MD, MO, ICVF, ISOVF, OD, SO. see #10)

AndrewZalesky commented 2 years ago

Thanks for the clarification Sina.

I really question whether we need to provide matrices for MD, MO, ICVF, ISOVF, OD and SO. I mean how many studies have actually mapped connectomes using these measures. I would imagine that we would be catering for a very small number of potential users.

If the others agree, of the micro-structural measures, I would only consider FA.

I agree with 5 options that you propose.

On 14/02/2022 12:12 pm, Sina Mansour L. wrote:

  * I am surprised that the storage size of all endpoints is equal
    to the storage size of a connectivity matrix? Let's say we
    have a 200 node parcellation and a connectivity density of
    50%, this would require storage of 9950 values. Whereas if we
    have 1 million valid streamlines, we would need to store 10^6
    x 3 x 2 = 6 million values.
It's actually the multiple forks that increase the size, here's a detailed breakdown: For atlases in smaller resolutions (such as the combination of 100 cortical parcels from Schaeffer and the coarsest subcortical atlas) a single connectome would only be 26kB. For higher resolution atlases, this goes up to 2MB for a single connectome (~1000 nodes). On average a single connectivity matrix will be around 1MB.

However, we are mapping 10 different measures and 92 different atlases which eventually sum up to ~800MB.

Whereas a single file containing all endpoints is ~80MB which is 4000 times larger than the smallest connectome size.
  * My only worry with providing streamline endpoints is that some
    users might not know how to generate a matrix based on
    endpoints. Or they simply don't want to go to through the
    hassle of installing software to do the computation. So I
    think that we should cater for users wanting to download
    connectivity matrices and thus provide 5 or so parcellation
    options.
I agree that we should provide matrices for users looking for prepared connectivity matrices. If we want to provide multiple parcellations, we can multiple lower resolution parcellations commonly used and a single one in higher resolution:

DK atlas + subcortex scale one (200kB)

Destrieux atlas + subcortex scale one (800kB)

Glasser atlas + subcortex scale one (3MB)

Glasser atlas + subcortex scale four (4MB)

Schaefer 500 nodes atlas + subcortex scale four (8MB)

Essentially the coarser atlases are a lot smaller. While using these coarse atlases solves the storage problem, it will still take considerable time to compute the connectomes (~10mins for 5 atlases and 10 measures) which we may need to optimize for.

Mapping the 5 connectomes mentioned above + providing endpoints and streamline metrics would take a total of <400MB (before compression):

~ 15MB for 50 connectivity matrices on 5 atlases ~ 80MB for endpoints ~ 80MB for 92 atlas combinations ~ 250MB for streamline metrics

In terms of saving storage, we may also want to reduce the number of metrics we map as currently we're mapping a variety of DTI metrics that are not that widely used in the context of connectivity matrices (FA, MD, MO, ICVF, ISOVF, OD, SO. see #10 https://github.com/sina-mansour/UKB-connectomics/issues/10)

— Reply to this email directly, view it on GitHub https://github.com/sina-mansour/UKB-connectomics/issues/21#issuecomment-1038520229, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXR7AR6HLHVVYW4BOSUPY3U3BJJNANCNFSM5OIV5SRQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

sina-mansour commented 2 years ago

Quantitative value per streamline;

Hi Rob,

Regarding the quantitative values, I wasn't able to find a way to compute length for every single streamline. I think that could be an option for tcksample? Currently, length is computed at the stage of connectivity mapping (tck2connectome with -scale_length) which could also be another part of the code we could save time on.

I'm also in favor of providing a very minimal container along with the code to enable easy connectivity matrix generation. This could definitely be an addition to the few (5 atlases) connectivity matrices that are directly provided and as you mentioned this gives a good degree of freedom to the user to choose different options (atlas, metric, smoothing, statistic).

Lestropie commented 2 years ago

tckstats -dump will give a length per streamline. That functionality doesn't make sense in tcksample because it's not "sampling" anything. (We've had many interesting discussions over the years regarding what functionalities should be provided where and with what name...)

Lestropie commented 2 years ago

Let's say we have a 200 node parcellation and a connectivity density of 50%, this would require storage of 9950 values.

For matrices that are explicitly stored as dense, zero values can't be "omitted"; they will still require as much data as what is required to store that zero (e.g. 2 bytes if "0,", more if e.g. "0.00000"). Storing generated matrices in a binary rather than ASCII format would also reduce storage space, akin to the other forms of data discussed in #23.

Indeed even for matrices explicitly stored using a sparse format, the locations of the non-zero elements has to be encoded somehow, and when you get to ~ 50% density there ends up being no benefit (or indeed even a detriment) to using a sparse format.

I agree that we should provide matrices for users looking for prepared connectivity matrices.

This does however introduce extra layers of complexity: both our decision of which matrices to compute & upload vs. what not to, as well as at the user end whether the particular matrix configuration they are interested in is provided or whether they need to use a tool to generate it. If such a hypothetical tool as per option 3 were to be sufficiently user-friendly and accessible, I would argue that not providing any matrices and deferring everything to that tool would actually be simpler. But I don't know what scope there is for e.g. integrating a computational tool into the UKB interface such that connectomes are generated on request rather than the user having to download all of the subject's data and then compute the matrix of interest locally.

~ 250MB for streamline metrics

This could be reduced by ~ half as per #23.

AndrewZalesky commented 2 years ago

I don't agree Rob. In my experience sparse formats can still offer significant benefits at 50%.

For dense storage, we need to store 8 bytes per entry (assuming double precision), which yields 200 x 200 x 8 = 320,000 bytes
For sparse storage, we store 8 bytes per entry plus 4 bytes for the integer indexing the entry location (12 bytes total per entry), which yields 12 x 200 x 200 x 0.5 = 240,000

That's a 25% saving!

Additionally, note also that we only need to store the upper or lower triangle due to symmetry, and thus this saving would be even greater in practice. I.e. 50% density actually converts to storing 25% of non-zero values.

On 14/02/2022 12:54 pm, Robert Smith wrote:

Let's say we have a 200 node parcellation and a connectivity
density of 50%, this would require storage of 9950 values.
For matrices that are explicitly stored as dense, zero values can't be "omitted"; they will still require as much data as what is required to store that zero (e.g. 2 bytes if "|0,|", more if e.g. "|0.00000|"). Storing generated matrices in a binary rather than ASCII format would also reduce storage space, akin to the other forms of data discussed in #23 https://github.com/sina-mansour/UKB-connectomics/issues/23.

Indeed even for matrices explicitly stored using a sparse format, the locations of the non-zero elements has to be encoded somehow, and when you get to ~ 50% density there ends up being no benefit (or indeed even a detriment) to using a sparse format.
I agree that we should provide matrices for users looking for
prepared connectivity matrices.
This does however introduce extra layers of complexity: both our decision of which matrices to compute & upload vs. what not to, as well as at the user end whether the particular matrix configuration they are interested in is provided or whether they need to use a tool to generate it. If such a hypothetical tool as per option 3 were to be sufficiently user-friendly and accessible, I would argue that not providing any matrices and deferring everything to that tool would actually be simpler. But I don't know what scope there is for e.g. integrating a computational tool into the UKB interface such that connectomes are generated on request rather than the user having to download all of the subject's data and then compute the matrix of interest locally.
~ 250MB for streamline metrics
This could be reduced by ~ half as per #23 https://github.com/sina-mansour/UKB-connectomics/issues/23.

— Reply to this email directly, view it on GitHub https://github.com/sina-mansour/UKB-connectomics/issues/21#issuecomment-1038544003, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXR7AS43FKZJTBI2K6OC7LU3BODTANCNFSM5OIV5SRQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

Lestropie commented 2 years ago

In my experience sparse formats can still offer significant benefits at 50%.

I am speaking naively on this; I'm familiar with some of the formats used for such on a technical level, but I've not actually utilised such data in practise, so I don't have the experience of where the benefits start to roll off. It also depends on the particular format used; different strategies are optimal for different manifestations of sparsity.

You are right in that the symmetry of the data adds another source of sparsity; I'd overlooked that.

Currently MRtrix3 just exports matrix data as text files. If you want to export data in binary format, sparse or otherwise, you could either go through a conversion process that you implement yourself, or I could code into MRtrix3 the ability to export data in an appropriate format. That aspect wouldn't take too much effort. But as per option 3 there's the question of whether we need to minimise storage space of such data vs. generating such data on request only.

AndrewZalesky commented 2 years ago

Perhaps we could define our own sparse file format?

I think that both options would be good:

A small number of connectivity matrices (e.g. the 5 that Sina defined)
Storage of end points + matlab/python function to convert endpoints to a connectivity matrix. I'd be happy to write the matlab function.

But we should wait for Caio's suggestions before going ahead!

On 14/02/2022 2:19 pm, Robert Smith wrote:

In my experience sparse formats can still offer significant
benefits at 50%.
I am speaking naively on this; I'm familiar with some of the formats used for such on a technical level, but I've not actually utilised such data in practise, so I don't have the experience of where the benefits start to roll off. It also depends on the particular format used; different strategies are optimal for different manifestations of sparsity.

You are right in that the symmetry of the data adds another source of sparsity; I'd overlooked that.

Currently /MRtrix3/ just exports matrix data as text files. If you want to export data in binary format, sparse or otherwise, you could either go through a conversion process that you implement yourself, or I could code into /MRtrix3/ the ability to export data in an appropriate format. That aspect wouldn't take too much effort. But as per option 3 there's the question of whether we need to minimise storage space of such data vs. generating such data on request only.

— Reply to this email directly, view it on GitHub https://github.com/sina-mansour/UKB-connectomics/issues/21#issuecomment-1038588264, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXR7AXADRV7KQ3ET5HLTD3U3BYEJANCNFSM5OIV5SRQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

Lestropie commented 2 years ago

Don't see any benefit in proposing a custom sparse format; it's a well-explored space, the reason for doing it would be reducing space but having the capacity to convert to dense via established softwares e.g. Matlab / Octave / Python.

As far as software for taking streamline endpoints / parcellation image / quantitative value per streamline and generating from that a connectome, devising a distributable function for that purpose is indeed an option. In comparison:

Advantages of wrapping the tck2connectome functionality:
- Access to features currently within such; this includes the marginally-superior heuristic of radially searching around the streamline endpoint rather than just sampling that voxel (which may or may not be an issue in your data; depends on the nuances), fast execution including multithreading, and pre-existing capability to do different edge-wise statistics;
- Inherit future implemented features of that command (which would hopefully include CSS support).
Disadvantages:
- It's not quite as clean to distribute as a container rather than just a standalone script (though I'm personally not a fan of relying on Matlab for open-source scientific distribution given its licensing);
- If an alternative format for track data is used (#23) it'd either need to be converted to .tck or I'd need to implement tractography file format handlers (MRtrix3/mrtrix3#411).

Not gravitating strongly in any particular direction myself... the main question will be whether just accessing the parcellation image value within the voxel in which the streamline endpoint resides is reliable for your data. My suspicion is that, even if both the 5TT image and the parcellation image are driven by FreeSurfer, the fact that the former is interpolated whereas the latter is not may introduce some "misses".

sina-mansour commented 2 years ago

Update:

After reducing the number of atlases and metrics provided (the list below) connectivity matrices are mapped in ~5minutes, and take ~15MB per subject.

Atlases:

DK atlas + Subcortex S1 (lowest resolution)
Destrieux atlas + Subcortex S1
Glasser atlas + Subcortex S1
Glasser atlas + Subcortex S4
Schaeffer 200 atlas + Subcortex S1
Schaeffer 500 atlas + Subcortex S4
Schaeffer 1000 atlas + Subcortex S4 (highest resolution)

Metrics:

Streamline count
FBC (from SIFT2)
Mean Length
Mean FA

The next steps are:

Testing alternative file formats to save space storing endpoint information and per streamline metrics
Implementing a simple python script to replace the current tck2connectome script which provides similar functionality, but saves time required to map matrices by only reading the streamlines once.
Adding optional smoothing functionality

I'll keep you posted on the next steps...

AndrewZalesky commented 2 years ago

Sounds good Sina.

I wonder if we can develop the python script in parallel while the jobs are running because won't need the script for computing the 7 matrices and end point file. Happy to write a matlab version to provide users with two options.

On 19/02/2022 4:24 pm, Sina Mansour L. wrote:

Update:

After reducing the number of atlases and metrics provided (the list below) connectivity matrices are mapped in ~5minutes, and take ~15MB per subject.

Atlases:

DK atlas + Subcortex S1 (lowest resolution)

Destrieux atlas + Subcortex S1

Glasser atlas + Subcortex S1

Glasser atlas + Subcortex S4

Schaeffer 200 atlas + Subcortex S1

Schaeffer 500 atlas + Subcortex S4

Schaeffer 1000 atlas + Subcortex S4 (highest resolution)

Metrics:

Streamline count

FBC (from SIFT2)

Mean Length

Mean FA

The next steps are:

Testing alternative file formats to save space storing endpoint information and per streamline metrics

Implementing a simple python script to replace the current tck2connectome script which provides similar functionality, but saves time required to map matrices by only reading the streamlines once.

Adding optional smoothing functionality

I'll keep you posted on the next steps...

— Reply to this email directly, view it on GitHub https://github.com/sina-mansour/UKB-connectomics/issues/21#issuecomment-1045799586, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXR7AXFXL43BOYQM33IUBDU34SPPANCNFSM5OIV5SRQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

sina-mansour commented 2 years ago

I wonder if we can develop the python script in parallel while the jobs are running because won't need the script for computing the 7 matrices and end point file.

Agreed, I think we can start running the jobs after finalizing file formats, then when the jobs are running, I'll work on the python script and also add CSS functionality. We could then have a second round (which would most likely be very quick) to map smooth connectomes for streamline count and FBC.

sina-mansour commented 2 years ago

Regarding the file format for endpoints:

The original MRtrix format takes ~80MB to store all endpoints (~2Million). This is not the most efficient storage format, as three nans are stored for every streamline to delineate between the start and end coordinate.

I have tried the following alternative formats thus far:

First tried a simple text file format (similar to storing a csv file). where every line contained 6 floats (delineated with a space character) indicating the coordinates (x,y,z in mm) of the endpoints. I tried 3 different precisions, and also checked whether compression made a difference:

3 digits floating precision ("%.3f"): ~100 MB before compression; 40MB after compression (.txt.gz)
2 digits floating precision ("%.2f"): ~85 MB before compression; 32MB after compression (.txt.gz)
1 digits floating precision ("%.1f"): ~71 MB before compression; 25MB after compression (.txt.gz)

I then tried saving it as a binary file. I used python's .npy format which could easily be read by either python or matlab. With this format, the endpoint information can be saved as a 3D array (N x 2 x 3) where N denotes the number of streamlines. Here is a report of the file sizes for the .npy files:

with 32bit float: ~50MB with 16bit float: ~25MB

My preference is toward the 16bit float option of the NPY format. It gives about 2 digits of floating precision, could be read in python or matlab, and can potentially be added to supported files by MRtrix.

Please let me know what your thoughts are.

I'll next test the same formats for per streamline metrics and report the results in this thread...

AndrewZalesky commented 2 years ago

Hi Sina,

I agree with you. I don't have any experience with half precision (16 bit) floats, but they seem widely supported. Apparently most GPUs work in 16 bits.

On 19/02/2022 5:53 pm, Sina Mansour L. wrote:

Regarding the file format for endpoints:

The original MRtrix format takes ~80MB to store all endpoints (~2Million). This is not the most efficient storage format, as three nans are stored for every streamline to delineate between the start and end coordinate.

I have tried the following alternative formats thus far:

First tried a simple text file format (similar to storing a csv file). where every line contained 6 floats (delineated with a space character) indicating the coordinates (x,y,z in mm) of the endpoints. I tried 3 different precisions, and also checked whether compression made a difference:

3 digits floating precision ("%.3f"): ~100 MB before compression; 40MB after compression (.txt.gz)

2 digits floating precision ("%.2f"): ~85 MB before compression; 32MB after compression (.txt.gz)

1 digits floating precision ("%.1f"): ~71 MB before compression; 25MB after compression (.txt.gz)

I then tried saving it as a binary file. I used python's .npy format which could easily be read by either python or matlab https://github.com/kwikteam/npy-matlab. With this format, the endpoint information can be saved as a 3D array (N x 2 x 3) where N denotes the number of streamlines. Here is a report of the file sizes for the .npy files:

with 32bit float: ~50MB with 16bit float: ~25MB

My preference is toward the 16bit float option of the NPY format. It gives about 2 digits of floating precision, could be read in python or matlab, and can potentially be added to supported files by MRtrix.

Please let me know what your thoughts are.

I'll next test the same formats for per streamline metrics and report the results in this thread...

— Reply to this email directly, view it on GitHub https://github.com/sina-mansour/UKB-connectomics/issues/21#issuecomment-1045904116, or unsubscribe https://github.com/notifications/unsubscribe-auth/ANXR7AUA2QHRLBU5S2L7AOLU34467ANCNFSM5OIV5SRQ. Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.

You are receiving this because you commented.Message ID: @.***>

sina-mansour commented 2 years ago

Here's an update regarding the per streamline file sizes.

For metrics other than streamline count, the streamline weights are stored in a metric file (SIFT2, length, FA, MD, MO, NODDI measures...).

Currently, mrtrix generated outputs are text files with space separated values. size of these files normally range from 20MB to 30MB.

I tried storing these as a vector using the NPY binary format. Here are the file sizes for different precisions:

64bit floats (double): ~20MB per metric
32bit floats: ~10MB per metric
16bit floats: ~5MB per metric

From what I see, the half precision floats (16bit) are enough to keep the meaningful information. For instance, here are some values taken from mean FA metric in double precision:

array([0.15744527, 0.43641067, 0.31822553, ..., 0.1769983 , 0.41859964, 0.15124135])

and here is the same information in half precision:

array([0.1575, 0.4365, 0.318 , ..., 0.177 , 0.4187, 0.1512], dtype=float16)

It keeps around 4 decimal digits which I think would be enough for our purposes.

To give you an overview of file sizes if we decide to use float16 precision:

streamline endpoints: 25MB
per streamline metrics (all metrics available): 9 x 5MB = 45MB

This sums to 70MB which is a lot smaller than the previous figure (~300MB).

Please let me know if you think using half precision (or the NPY format) may be problematic. Otherwise, we could use this format to store metrics and save storage space.

Lestropie commented 2 years ago

In addition to the precision limitations, float16 is also limited in its range: anything outside of ~ 0.0001 - 10000 becomes problematic. So would need to verify that none of the proposed metrics are outside of this; e.g. depending on units, mean diffusivity could be an issue, with values ~ 3x10^-6 for free water.

Using NPY for the streamline endpoints would presumably be a good choice also. Supporting this within MRtrix3 would however be a little tougher: in addition to requiring #411 (which I've been avoiding for years) it doesn't support irregular matrices, which is what would be required for general streamline storage.

The NPY format claims to support variable endianness, though it's unclear to me yet exactly how this is achieved.

I believe that "GPUs working in 16 bits" is primarily around ML; I think the tensor cores on the newest hardware are tailored for 4x4 float16 multiplications. They have opcodes for operating on half / single / double precision, but the speedup in terms of flops for lower-precision is insane.

For computational efficiency it would be preferable for the data to be immediately saved in this format by MRtrix3, rather than writing to a text file and then having to convert: MRtrix3/mrtrix3#2435.

Lestropie commented 2 years ago

@sina-mansour If you pull updates to the MRtrix3 ukb branch, binaries can now export directly to .npy format. In addition, if you set NPYFloatMaxSavePrecision to 16 in your MRtrix config file, or alternatively execute any relevant command with the additional command-line option -config NPYFloatMaxSavePrecision 16, it should save as 16-bit floating-point.

sina-mansour commented 2 years ago

@sina-mansour If you pull updates to the MRtrix3 ukb branch, binaries can now export directly to .npy format. In addition, if you set NPYFloatMaxSavePrecision to 16 in your MRtrix config file, or alternatively execute any relevant command with the additional command-line option -config NPYFloatMaxSavePrecision 16, it should save as 16-bit floating-point.

Thanks Rob, I'll give it a try today and use the new config function to directly export the binaries in .npy format with MRtrix.

The NPY format claims to support variable endianness, though it's unclear to me yet exactly how this is achieved.

I think numpy array objects (ndarray) generally support variable endianness (a few examples here).

it doesn't support irregular matrices, which is what would be required for general streamline storage.

Nibabel defines an ArraySequence object which is used when importing tractography data. This object also supports saving data az binaries in .npz format. I guess another alternative for general streamline storage (which doesn't require the Nibabel dependency) is to store the streamlines as a list of (N,3) ndarrays using numpy's .npz format which is used to store collection of arrays (see numpy.savez).

sina-mansour commented 2 years ago

In addition to the precision limitations, float16 is also limited in its range: anything outside of ~ 0.0001 - 10000 becomes problematic. So would need to verify that none of the proposed metrics are outside of this; e.g. depending on units, mean diffusivity could be an issue, with values ~ 3x10^-6 for free water.

I've tested the metrics (for a single subject) to ensure that the values would not be distorted after precision reduction (see below). Although MD was closer to values in which a float16 will lose validity it did not get below 0.0005; none of the metrics passed the limits (i.e. it would be safe to save all metrics as float16).

I've also generated the following plot which shows that float 16 is valid for values ranging from 1e-5 to 1e5. (The x axis shows the floating number and the y axis shows accuracy, i.e. 1 is ideal.

Lestropie commented 2 years ago

The NPY format claims to support variable endianness, though it's unclear to me yet exactly how this is achieved.

I think numpy array objects (ndarray) generally support variable endianness

Yeah, I figured it out once I started pursuing .npy format support for 1D & 2D regular matrices. Those should import data with either endianness, and in either column-major or row-major. If I pursue such support for tractography data, I would need to deal with it there also.

For generating the tractography endpoint data, a tailored command would be substantially less development effort. Just have a standalone command that reads in a track file, takes just the endpoints, and writes to a .npy. I could write something against the MRtrix3 C++ API or you could do this component yourself. Generalising MRtrix3 tractography data support to facilitate having tckresample -endpoints write to .npy or .npz would actually be a lot of work, and I'm conscious of the balance between computational optimisation and holding up the commencement of computations.

Although MD was closer to values in which a float16 will lose validity it did not get below 0.0005; none of the metrics passed the limits (i.e. it would be safe to save all metrics as float16).

S0 is also teetering.

A capability used in image data storage to address this kind of issue is to embed an offset and scaling factor in the image header. So e.g. one can store an FA image using an integer datatype, by having image intensities in the range 0-10000 and an item in the image header that instructs the reader to scale the image intensities by 0.0001 for correct interpretation. MRtrix3 supports this well for image data.

While the .npy format does not appear to have such a capability, you could nevertheless do similar yourself. E.g. multiply the S0 image by 0.001 prior to tcksample, and simply document that the values stored in that NPY correspond to (0.001 x S0). Do the same thing with MD with a 1000 multiplier. Just brings the stored values closer to the centre of the distribution (1.0) so that precision-limited numerical rounding is mitigated.

sina-mansour commented 2 years ago

While the .npy format does not appear to have such a capability, you could nevertheless do similar yourself. E.g. multiply the S0 image by 0.001 prior to tcksample, and simply document that the values stored in that NPY correspond to (0.001 x S0). Do the same thing with MD with a 1000 multiplier. Just brings the stored values closer to the centre of the distribution (1.0) so that precision-limited numerical rounding is mitigated.

Thanks for this suggestion Rob, I'll add that in to make sure we're safe with MD and S0.

I already have written a script to convert the streamline endpoints to .npy format so I'll use that for now not to postpone batch job executions. Finally, I tried pulling the latest version of the ukb branch (containing the NPY format pushes). From what I understood, MRtrix (in both tcksample and tckstats) would handle this by checking file suffix. I tried running the script saving in a NPY format (filename: streamline_metric_FA_mean.npy), However, the updated MRtrix pull still saves a text file with a .npy extension. Do I need to explicitly add a flag for file format?

This is an example script I'm using:

tcksample -info -config NPYFloatMaxSavePrecision 16 -stat_tck mean "${tracks}" "${dmri_dir}/dti_FA.nii.gz" "${dmri_dir}/streamline_metric_FA_mean.npy"

Lestropie commented 2 years ago

Did you recompile after the pull? If it's not as simple as that I'll likely look at it tomorrow.

sina-mansour commented 2 years ago

Did you recompile after the pull? If it's not as simple as that I'll likely look at it tomorrow.

I did, I even tried running the configure script just in case (before running build). However, the output was still a text file with a .npy extension.

sina-mansour commented 2 years ago

Oops, my bad, I was using the old binaries.

Now tcksample is generating binaries as expected. However, tckstats -dump is still saving text files:

tckstats -info -config NPYFloatMaxSavePrecision 16 -dump "${streamline_length_npy}" "${tracks}"

sina-mansour commented 2 years ago

Following on the previous comment, can the NPY option be used in conjunction with other MRtrix commands?

Here are a few examples:

with tckstats -dump to store length as a binary NPY file
with tcksift2 to store SIFT2 weightings
with commands like mrmath to be able to multiply / divide values before converting to float16
and also to be read by tck2connectome to generate weighted tractograms from NPY weight files

sina-mansour commented 2 years ago

This latest commit uses a simple python script to convert and store binaries. I'm happy to replace them with MRtrix native commands once that's final.

Lestropie commented 2 years ago

tckstats -dump uses different code, because it just dumps the lengths in string form as they are calculated rather than building them into a dense vector representation and then exporting them using the backend library. I can change that.
tcksift2 should work already
By mrmath, I presume you mean being able to store image data with float16 datatype? Currently this isn't supported, but based on Donald's comment in MRtrix3/mrtrix3#2437 I'll probably try implementing that.
tck2connectome -scale_file should work already

Lestropie commented 2 years ago

How do you think the runtime looks for tcksample? We'd discussed previously that it could receive as input a 4D image, and it could export the e.g. mean value along each streamline for each 3D image volume; the output would then be a 2D matrix (which the .npy format support should handle). But that would then be slightly tricky to feed to tck2connectome; a separate script would need to extract a specific column from the .npy and save it as a 1D vector to then feed to tck2connectome.

sina-mansour commented 2 years ago

How do you think the runtime looks for tcksample? We'd discussed previously that it could receive as input a 4D image, and it could export the e.g. mean value along each streamline for each 3D image volume; the output would then be a 2D matrix (which the .npy format support should handle). But that would then be slightly tricky to feed to tck2connectome; a separate script would need to extract a specific column from the .npy and save it as a 1D vector to then feed to tck2connectome.

Currently, we're running tcksample 7 times, and every single time it takes around 10 seconds (on my PC) so it takes a total of ~70 seconds. I think this is not too slow, so we could even use it as-is for now (I'll update this once I run the same test on spartan in case the times vary).

Lestropie commented 2 years ago

tckstats -dump should now work with .npy as of https://github.com/MRtrix3/mrtrix3/commit/771d50bb0c00a8fdbaaffa7f60313b056008c207

sina-mansour commented 2 years ago

I was comparing the accuracy of the float16 NPY files generated from tcksample with the original text files, and I realized that the degree of difference is larger than expected:

Here a few example values taken from tha FA measure:

Original precision (text format): [0.16438006, 0.43450493, 0.36460525, 0.24658318, 0.44166291]

MRtrix float16 sampling: [0.1575, 0.4365, 0.318 , 0.2805, 0.4377]

However, if the resulting text file were to ba saved in NPY in python, this is the resulting values: [0.1644, 0.4346, 0.3645, 0.2466, 0.4417]

What I think might be happenning is that the sampling itself is done at the half precision. This could potentially increase innacuracies in mathematical computations along the streamline.

@Lestropie do you think that's what's causing the variations? If so, it would be possible to perform the sampling in high precision and only store the final metric in a reduced precision?

I could keep the script in its current format (converting text to NPY with python) for now to start batch job submissions sooner. But NPY handling would be a great additional feature for MRtrix in general.

sina-mansour commented 2 years ago

An update on the previous comment:

Both tckstats -dump & tcksift2 produced accurate NPY converted metric files, so the accuracy issue is only specific to tkcsample outputs.

sina-mansour commented 2 years ago

tck2connectome -scale_file should work already

This works as expected in mapping connectomes directly from NPY files.

But reading the NPY produces some strange print logs: (the Openers log generates some verbose output when reading NPY header)

tck2connectome: [INFO] opening image "/home/sina/Documents/Research/Datasets/UK_biobank/sample/1000243_2_0/dMRI/dMRI/atlases/combinations/native.dMRI_space.aparc.a2009s+Tian_Subcortex_S1_3T.nii.gz"...
tck2connectome: [INFO] image "/home/sina/Documents/Research/Datasets/UK_biobank/sample/1000243_2_0/dMRI/dMRI/atlases/combinations/native.dMRI_space.aparc.a2009s+Tian_Subcortex_S1_3T.nii.gz" opened with dimensions 256x256x256, voxel spacing 1x1x0.99999994039535522, datatype UInt32LE
tck2connectome: [100%] uncompressing image "/home/sina/Documents/Research/Datasets/UK_biobank/sample/1000243_2_0/dMRI/dMRI/atlases/combinations/native.dMRI_space.aparc.a2009s+Tian_Subcortex_S1_3T.nii.gz"
Openers: []; prev_was_escape = 0; current: ; key: 
Openers: [']; prev_was_escape = 0; current: '; key: 
Openers: [']; prev_was_escape = 0; current: 'd; key: 
Openers: [']; prev_was_escape = 0; current: 'de; key: 
Openers: [']; prev_was_escape = 0; current: 'des; key: 
Openers: [']; prev_was_escape = 0; current: 'desc; key: 
Openers: [']; prev_was_escape = 0; current: 'descr; key: 
Openers: []; prev_was_escape = 0; current: 'descr'; key: 
Openers: []; prev_was_escape = 0; current: ; key: descr
Openers: []; prev_was_escape = 0; current: ; key: descr
Openers: [']; prev_was_escape = 0; current: '; key: descr
Openers: [']; prev_was_escape = 0; current: '<; key: descr
Openers: [']; prev_was_escape = 0; current: '<f; key: descr
Openers: [']; prev_was_escape = 0; current: '<f2; key: descr
Openers: []; prev_was_escape = 0; current: '<f2'; key: descr
Openers: []; prev_was_escape = 0; current: ; key: 
Openers: []; prev_was_escape = 0; current: ; key: 
Openers: [']; prev_was_escape = 0; current: '; key: 
Openers: [']; prev_was_escape = 0; current: 'f; key: 
Openers: [']; prev_was_escape = 0; current: 'fo; key: 
Openers: [']; prev_was_escape = 0; current: 'for; key: 
Openers: [']; prev_was_escape = 0; current: 'fort; key: 
Openers: [']; prev_was_escape = 0; current: 'fortr; key: 
Openers: [']; prev_was_escape = 0; current: 'fortra; key: 
Openers: [']; prev_was_escape = 0; current: 'fortran; key: 
Openers: [']; prev_was_escape = 0; current: 'fortran_; key: 
Openers: [']; prev_was_escape = 0; current: 'fortran_o; key: 
Openers: [']; prev_was_escape = 0; current: 'fortran_or; key: 
Openers: [']; prev_was_escape = 0; current: 'fortran_ord; key: 
Openers: [']; prev_was_escape = 0; current: 'fortran_orde; key: 
Openers: [']; prev_was_escape = 0; current: 'fortran_order; key: 
Openers: []; prev_was_escape = 0; current: 'fortran_order'; key: 
Openers: []; prev_was_escape = 0; current: ; key: fortran_order
Openers: []; prev_was_escape = 0; current: ; key: fortran_order
Openers: []; prev_was_escape = 0; current: F; key: fortran_order
Openers: []; prev_was_escape = 0; current: Fa; key: fortran_order
Openers: []; prev_was_escape = 0; current: Fal; key: fortran_order
Openers: []; prev_was_escape = 0; current: Fals; key: fortran_order
Openers: []; prev_was_escape = 0; current: False; key: fortran_order
Openers: []; prev_was_escape = 0; current: ; key: 
Openers: []; prev_was_escape = 0; current: ; key: 
Openers: [']; prev_was_escape = 0; current: '; key: 
Openers: [']; prev_was_escape = 0; current: 's; key: 
Openers: [']; prev_was_escape = 0; current: 'sh; key: 
Openers: [']; prev_was_escape = 0; current: 'sha; key: 
Openers: [']; prev_was_escape = 0; current: 'shap; key: 
Openers: [']; prev_was_escape = 0; current: 'shape; key: 
Openers: []; prev_was_escape = 0; current: 'shape'; key: 
Openers: []; prev_was_escape = 0; current: ; key: shape
Openers: []; prev_was_escape = 0; current: ; key: shape
Openers: [(]; prev_was_escape = 0; current: (; key: shape
Openers: [(]; prev_was_escape = 0; current: (2; key: shape
Openers: [(]; prev_was_escape = 0; current: (21; key: shape
Openers: [(]; prev_was_escape = 0; current: (219; key: shape
Openers: [(]; prev_was_escape = 0; current: (2190; key: shape
Openers: [(]; prev_was_escape = 0; current: (21906; key: shape
Openers: [(]; prev_was_escape = 0; current: (219066; key: shape
Openers: [(]; prev_was_escape = 0; current: (2190667; key: shape
Openers: [(]; prev_was_escape = 0; current: (2190667,; key: shape
Openers: []; prev_was_escape = 0; current: (2190667,); key: shape
Openers: []; prev_was_escape = 0; current: ; key: 
Final keyvalues: { 'descr': '<f2',  'fortran_order': 'False',  'shape': '(2190667,)', }
tck2connectome: [100%] Constructing connectome

Lestropie commented 2 years ago

Both tckstats -dump & tcksift2 produced accurate NPY converted metric files, so the accuracy issue is only specific to tkcsample outputs.

OK that's useful, I don't quite know what's going on with tcksample there but I can look into it.

But reading the NPY produces some strange print logs

That will be the result of me cherry-picking things I'm implementing on dedicated code development branches onto the ukb branch for you to use, and failing to propagate everything relevant. Should be reasonably easy to fix.

Lestropie commented 2 years ago

https://github.com/MRtrix3/mrtrix3/commit/b8ab21043d5dc79913d056450542cdaa69840868 should disable the NPY parsing debug outputs. Looking into tcksample next.

Lestropie commented 2 years ago

I can't reproduce the issue where tcksample outputs written with 16-bit precision are numerically different to 32-bit precision and ASCII text. You'll need to experiment a bit at your end to figure out where the discrepancy is coming from. The 16-bit conversion only occurs at the point of file write, so it shouldn't be an issue of sampling / calculating at 16-bit precision. Indeed, while I've generalised support for 16-bit floating-point numbers in https://github.com/MRtrix3/mrtrix3/pull/2437, I've not cherry-picked that into the ukb branch, so there's no way that such calculations could be happening.

Lestropie commented 2 years ago

PS. The magnitude of the difference looks like about what you'd expect when changing between use and non-use of the -precise option; maybe it's a mistake of provenance?

sina-mansour commented 2 years ago

PS. The magnitude of the difference looks like about what you'd expect when changing between use and non-use of the -precise option; maybe it's a mistake of provenance?

Thanks for this pointer Rob, that was an issue on my end comparing the differently sampled approaches. Sorry for the inconvenience. This is now fixed from my side and works as expected.

sina-mansour commented 2 years ago

Hi all,

I just wanted to post this report of the first complete run as a batch process on Spartan:

I requested a single core with 16G memory. The complete structural connectivity mapping process took nearly 3 hours. Here's a breakdown of how long every step took:

Downloading bulk files from UKB and extracting zips: ~5 minutes
Initial image preparations: a few seconds
Response function estimation: ~1 minute
MSMT CSD FOD estimation: ~16 minutes
Running 5ttgen: ~2 minutes
Probabilistic tractography (10 million seeds): ~ 2 hours and 10 minutes
Sift2 (without any downsampling, see #24 ): ~20 minutes
Metric sampling per streamline: ~4 minutes
mapping connectomes: ~ 3 minutes

As it can be seen the main execution time is spent on the probabilistic tractography.

Here, I also wanted to report the file sizes generated:

A total of 100MB of data is stored per subject which breaks down to:

Native DWI atlases (cortical & subcortical separately, not the combined ones, see #20 ): ~2MB
Connectivity matrices (4 metrics, 7 atlases): ~29MB
Endpoints in NPY half-precision format: ~28MB
Per-streamline metrics (to enable mapping connectomes on other atlases): ~42MB

Additionally, this was the utilization of the requested resources:

$ seff 34468107
Job ID: 34468107
Cluster: spartan
User/Group: sina_mansour/punim1566
State: COMPLETED (exit code 0)
Cores: 1
CPU Utilized: 02:56:00
CPU Efficiency: 96.90% of 03:01:38 core-walltime
Job Wall-clock time: 03:01:38
Memory Utilized: 1.21 GB
Memory Efficiency: 7.59% of 16.00 GB

To me, it was really strange that the memory utilization was so low. I had imagined that at least tcksift2 would require considerably more memory. Nevertheless, this indicates that we could potentially request for less resources (maybe only 4GB of memory is more than enough, let me know what you think?)

Finally, every job generates around 6GB of data on the scratch space (excluding the intermediary files generated by 5ttgen or other mrtrix commands). That gives us a moderate degree of parallelization. Neglecting the intermediary files, we could approximately have at most 300 jobs running in parallel without running into scratch quota issues. (we have a total of 2TB scratch space). Nevertheless, there were days where spartan gave us as many as ~500 jobs in parallel, which if happens, would cause scratch quota issues.

@caioseguin maybe it's worth requesting more scratch quota for the project, i.e. up to 5TB?

I'll next run the first 100 subjects, and hopefully, we could soon have a look at their connectomes (I'll reduce the memory resource request to 8GB for now).

Lestropie commented 2 years ago

A total of 100MB of data is stored per subject

Do the parcellations in native space need to be stored? If UKB already provides the non-linear warp between template and subject, then this is something that could be integrated into the script responsible for generating the connectome matrices, would not take long to execute in an on-demand basis, and so there's little benefit in UKB storing those data.
Is quoted connectivity matrix size based on tck2connectome output? If so, what format / precision? Even then, could still be halved by writing to a 1D vector that can then be remapped to a 2D symmetric matrix. But kind of circumvents the purpose of these particular matrices, being ease of access compared to having to execute something to generate.
It would not surprise me if UKB were to express a preference to not store any connectivity matrices, given that we need to provide a tool to do that anyway in order to support a comprehensive range of parcellations, and omitting such from the submitted data would reduce storage by 30%. The question is the extent to which we can streamline the utilisation of that tool.

I had imagined that at least tcksift2 would require considerably more memory.

A lot of effort was invested in minimising RAM usage here... check out this hack.

(maybe only 4GB of memory is more than enough, let me know what you think?)

Should be safe. I wouldn't expect a more than doubling of RAM from one subject to another.

Finally, every job generates around 6GB of data on the scratch space (excluding the intermediary files generated by 5ttgen or other mrtrix commands).

Does SPARTAN provide the capability to use a RAM filesystem on executed jobs? If so, it may be preferable for data to go there rather than the HDD scratch, especially if you can request enough RAM to cover the whole job. May give a little execution speedup as well.

Nevertheless, there were days where spartan gave us as many as ~500 jobs in parallel, which if happens, would cause scratch quota issues.

Was this requesting 500 jobs with the same RAM requirements and wall time? Otherwise it's not a valid metric, since this goes into the fair use calculations that determine what gets executed when.

I'll next run the first 100 subjects

If you're wanting to run subjects data with the intention of keeping those data for UKB submission, I'd like to read through the script as a whole again and try hard to come up with anything being overlooked. Also should consider whether any other data should be saved; e.g. may not want to discard the revised DWI-T1w transformation.

sina-mansour commented 2 years ago

Do the parcellations in native space need to be stored? If UKB already provides the non-linear warp between template and subject, then this is something that could be integrated into the script responsible for generating the connectome matrices, would not take long to execute in an on-demand basis, and so there's little benefit in UKB storing those data.

The generation of native parcellation was not a direct nonlinear warp of the atlases from MNI. A more accurate native atlas was generated by using the freesurfer derived native surface meshes which is called from this script. Nevertheless, as we are providing all atlases in the native T1 space, I think we could possibly save on space by providing the rigid transformation from T1 to DWI space. The only downside of that would be that while we are saving storage space, the user would need to download more files (The list of native atlases as one compressed file + the diffusion tractography outputs as a separate compressed file.). But I personally think that this wouldn't be a bad compromise to save storage space.

Is quoted connectivity matrix size based on tck2connectome output? If so, what format / precision? Even then, could still be halved by writing to a 1D vector that can then be remapped to a 2D symmetric matrix. But kind of circumvents the purpose of these particular matrices, being ease of access compared to having to execute something to generate.

Yes, it's a simple CSV output of tck2connectome. As you mentioned this could potentially be further reduced, but I too think that since these files are tailored to users who may not want further conversions to use the data the 1-dimensional vectorization would not be appropriate. But I don't really have a strong opinion on this.

It would not surprise me if UKB were to express a preference to not store any connectivity matrices, given that we need to provide a tool to do that anyway in order to support a comprehensive range of parcellations, and omitting such from the submitted data would reduce storage by 30%. The question is the extent to which we can streamline the utilisation of that tool.

In fact, the total amount of files that we are planning to upload to UKB storage also includes resting-state Functional activity and native atlases, which sums to nearly 300MB per individual. Hence, not providing the matrices would only reduce storage by ~10%. However, I think these matrices are most likely what many individuals are interested to have. I think many would rather not use an extra tool to compute the matrices from the stored endpoints and metrics. Hence, these select connectivity matrices were provided to increase the ease of access.

Does SPARTAN provide the capability to use a RAM filesystem on executed jobs? If so, it may be preferable for data to go there rather than the HDD scratch, especially if you can request enough RAM to cover the whole job. May give a little execution speedup as well.

I'm not aware if spartan provides a RAMDISK on the computing nodes (it's not mentioned anywhere here). I'll ask the HPC admins and see if that's possible.

Was this requesting 500 jobs with the same RAM requirements and wall time? Otherwise it's not a valid metric, since this goes into the fair use calculations that determine what gets executed when.

That's right, the previous jobs had different requirements (if I remember correctly, similar wall time, but higher RAM requirements). That's hence just a crude estimate mainly based on the number of cores allocated at any time.

If you're wanting to run subjects data with the intention of keeping those data for UKB submission, I'd like to read through the script as a whole again and try hard to come up with anything being overlooked. Also should consider whether any other data should be saved; e.g. may not want to discard the revised DWI-T1w transformation.

The first 100 are for final verification of the pipeline before running batches for the whole population, so we could potentially delete the generated files and re-execute the scripts again (it wouldn't take that long anyways). Please feel free to review the current scripts (tractography, and connectome mapping) for any final comments before batch execution for the whole ~50,000 instances.

P.S. the first 100 jobs have finished and I'll share some of the connectomes in an email soon.

Lestropie commented 2 years ago

The generation of native parcellation was not a direct nonlinear warp of the atlases from MNI. A more accurate native atlas was generated by using the freesurfer derived native surface meshes which is called from this script.

Confirm that this is the case for all parcellations; ie. there are no parcellations that are based on a volumetric transformation?

Nevertheless, it's a question of storage size / download size / computational complexity between pre-calculating and uploading such data vs. providing a tool that performs such locally on demand. If I were UKB, I'd only want to be storing data that have some requisite amount of computational complexity in their generation; anything that can be trivially computed from data already stored should be done on demand. But I am entirely hypothesizing / projecting here; given the scope of the project you may need to communicate with them in more detail as to how they wish for these things to be balanced.

Yes, it's a simple CSV output of tck2connectome. As you mentioned this could potentially be further reduced, but I too think that since these files are tailored to users who may not want further conversions to use the data the 1-dimensional vectorization would not be appropriate. But I don't really have a strong opinion on this.

Agree the 1D vectorization becomes clumsy; but you could nevertheless reduce the storage space by saving the tck2connectome output by writing to NPY rather than e.g. CSV. You may not be able to use 16-bit precision in all cases, but 32-bit should still yield a reduction.

the previous jobs had different requirements

All good; mostly wanted to curb your expectations in case you had done a naive parallelization check that would then lead to disappointment when running the actual jobs.

Please feel free to review the current scripts

At the very least, should strip out all of the inline comments that I added to the file itself in the first round of feedback.

sina-mansour commented 2 years ago

Confirm that this is the case for all parcellations; ie. there are no parcellations that are based on a volumetric transformation?

All cortical parcellations are generated with this approach (but not the subcortex)

Agree the 1D vectorization becomes clumsy; but you could nevertheless reduce the storage space by saving the tck2connectome output by writing to NPY rather than e.g. CSV. You may not be able to use 16-bit precision in all cases, but 32-bit should still yield a reduction.

I'll try storing the connectomes in NPY format and change to that if it results in a considerable size reduction.

At the very least, should strip out all of the inline comments that I added to the file itself in the first round of feedback.

Sure ✅

sina-mansour commented 2 years ago

I'll try storing the connectomes in NPY format and change to that if it results in a considerable size reduction.

This didn't change the storage requirements (a single-precision floating-point NPY takes nearly the same amount of storage as a CSV file). I will revert back to using CSV to keep a more familiar connectome format.

Lestropie commented 2 years ago

I will revert back to using CSV to keep a more familiar connectome format.

What's the number of significant figures in the CSVs? If sticking to a string format this can nevertheless be modulated to similar ends. E.g. tck2connectome probably writes double-precision to string, which would be 15 significant figures; complete overkill.

sina-mansour commented 2 years ago

What's the number of significant figures in the CSVs? If sticking to a string format this can nevertheless be modulated to similar ends. E.g. tck2connectome probably writes double-precision to string, which would be 15 significant figures; complete overkill.

There are actually only 6 significant digits in the CSV files, I think this might be a result of saving the metrics in a half-precision format.

Lestropie commented 2 years ago

6 digits is default for single-precision (ie. 32-bit) floating-point. Half-precision appears to default to 4.

sina-mansour commented 2 years ago

As per prior discussions, the pipeline is now finalized to speed up execution duration while maximizing the mapped tractogram quality with the stat-of-the-art tractography techniques.