sina-mansour / UKB-connectomics

This repository will host scripts used to map structural and functional brain connectivity matrices for the UK biobank dataset.
https://www.biorxiv.org/content/10.1101/2023.03.10.532036v1
62 stars 7 forks source link

MSMT CSD runtime #14

Closed sina-mansour closed 2 years ago

sina-mansour commented 2 years ago

In the current implementation, dwi2fod msmt_csd is actually using all 8 threads and takes ~12mins, I just noticed that I could set it to use only 1 thread (-nthreads 0) however, I suspect that this would drastically increase runtime. I'll test that and update you on how long it takes on a single thread.

Do you have any ideas as to what we could do to make it potentially faster without detrimental effects?

sina-mansour commented 2 years ago

Update: the time taken for dwi2fod indeed increased: from ~12mins to ~33mins

Given that this would reduce the number of streamlines estimated in a fixed time constraint (a ballpark figure of ~2hours), we may want to optimize/speed up this step.

@Lestropie what are your thoughts?

caioseguin commented 2 years ago

Very naive question: does run time decrease linearly with number of threads in this application? Or does the benefit of adding more threads plateaus at some point? If it does plateau and having 2 (or 3, or 4) threads offers a big speed up in comparison to just 1, we could consider asking for more slightly more resources when we submit the jobs, without having to go all the way to 8 threads. Would trying something like this make any sense?

Lestropie commented 2 years ago

@Lestropie what are your thoughts?

Only ways to reduce the number of CPU cycles per voxel are:

  1. To reduce the lmax of the WM ODF;

  2. To reduce the number of directions along which non-negativity is enforced.

lmax=8 has been a precedent since ~ 2007 so it's difficult to get away from. Decreasing would make FODs less sharp, which would make streamlines disperse more (all other parameters being equal).

Reducing 2. would be detrimental to the conditioning of the problem. It's actually already possible for there to be directions in which the WM FOD is negative, though with tiny amplitude, simply because this set of evaluation directions is not infinite. I'd be a little nervous about decreasing this.

does run time decrease linearly with number of threads in this application?

I'd have expected it to be close to reciprocal with the number of cores. It multi-threads very effectively, with minimal cross-talk or need for mutexing (there's a bit more than there would technically need to be simply because of the way our multi-threading back-end works, but still minimal overhead). However unlike tractography I would not expect it to benefit a great deal from hyper-threading, since there is minimal downtime waiting for memory access where another thread can be executing.

If it does plateau and having 2 (or 3, or 4) threads offers a big speed up in comparison to just 1, we could consider asking for more slightly more resources when we submit the jobs, without having to go all the way to 8 threads.

This is a question of how you quantify resource use. For the HPC sysadmins, there would be minimal to no difference between your requesting N jobs each with 1 thread and 4 hours walltime, and requesting N jobs each with 2 threads and 2 hours walltime; neither requires node exclusivity, and the total CPU hours is the same. Only potential difference is whether requesting 2 threads results in 2 cores or 1 core with hyper-threading: the latter yields a ~50% improvement in tractography, but I doubt would change much for CSD.


I would also note that if the computational cost of CSD is higher than expected, it's a stronger motivation for performing CSD in a way that would broaden its applicability (#3) and then distribute to the community (#11).

Lestropie commented 2 years ago

Also #4 will be increasing the total number of voxels, which will proportionally increase the runtime. If it's getting problematic, that safety measure could be removed.

sina-mansour commented 2 years ago

The current version of the CSD will run on a single core. Any lost time on a single job is expected to be gained back by a relative increase in the number of parallel jobs executed at any time.