SIFT2 weights - Githubissues

sina-mansour commented 2 years ago

I was checking the per streamline metrics when I observed a strange output for the streamline weights from SIFT2:

After tckgen, nearly 2 million streamlines were mapped (2190667 to be exact). I was checking the distribution of weights that SIFT2 generates, and I noticed that all streamlines had an equal weight of 1.

@Lestropie why do you think this might have happened? Is it a sign of undersampling? I'm simply using tcksift2 with the default functionality (no extra options). Do you reckon this is normal? Do you recommend altering any options?

The more important question is whether this is only specific to this subject, or whether it would happen across all subjects.

Lestropie commented 2 years ago

Not something that I've ever observed. Did the terminal output indicate that some number of iterations were executed? You can also specify the -out_csv option and see what that says. If no clues from that I'd need to see the raw data.

sina-mansour commented 2 years ago

Here's what the tcksift2 outputs:

tcksift2: [INFO] opening image "/home/sina/Documents/Research/Datasets/UK_biobank/sample/1000243_2_0/dMRI/dMRI/wmfod_norm.mif"...
tcksift2: [INFO] image "/home/sina/Documents/Research/Datasets/UK_biobank/sample/1000243_2_0/dMRI/dMRI/wmfod_norm.mif" opened with dimensions 104x104x72x45, voxel spacing 2.0192299999999999x2.0192299999999999x2x1, datatype Float32LE
tcksift2: [100%] Creating homogeneous processing mask
tcksift2: [100%] segmenting FODs
tcksift2: [100%] mapping tracks to image
tcksift2: [INFO] Proportionality coefficient after streamline mapping is 0.00098915378859790399
tcksift2: [INFO] 131735 fixels have no attributed streamlines; these account for 7.8545167441684862% of the initial cost function
tcksift2: [INFO] 191018 of 322753 fixels were tracked, but have been excluded from optimisation due to inadequate reconstruction;
tcksift2: [INFO] these contribute 92.145483255831081% of the initial cost function
tcksift2: [INFO] Constant A scaling regularisation terms to match data term is 0.016847761147202777
tcksift2:   Iteration     CF (data)      CF (reg)     Streamlines
tcksift2: [done]        10        100.000%         0.000%        2190647
tcksift2: [done]        10        100.000%         0.000%        2190647

After adding the -csv this is the stat saved:

Iteration	Cost_data	Cost_total	Streamlines	Fixels_excluded
0	46513.6	46513.6	2190647	191018
1	46513.61547	46513.61547	2190647	191018
2	46513.61547	46513.61547	2190647	191018
3	46513.61547	46513.61547	2190647	191018
4	46513.61547	46513.61547	2190647	191018
5	46513.61547	46513.61547	2190647	191018
6	46513.61547	46513.61547	2190647	191018
7	46513.61547	46513.61547	2190647	191018
8	46513.61547	46513.61547	2190647	191018
9	46513.61547	46513.61547	2190647	191018
10	46513.61547	46513.61547	2190647	191018

Please let me know if these can tell what's causing the problem.

In the meanwhile, I am preparing example output files from a single scan to share (all files related to tractography and connectivity mapping will be included). I'll be sending an email with a link to access the data soon and maybe that could also help find out what's causing the problem.

Lestropie commented 2 years ago

This line is not uncommon, though the proportion may be a bit higher than typical:

tcksift2: [INFO] 131735 fixels have no attributed streamlines; these account for 7.8545167441684862% of the initial cost function

This one is higher than expected:

tcksift2: [INFO] 191018 of 322753 fixels were tracked, but have been excluded from optimisation due to inadequate reconstruction;

That gets triggered by this line. Not sure what it might be about your data that's causing the numbers to be like that, though admittedly I've not looked at the values here for a number of years now. Getting my hands on some raw data will indeed be necessary here.

sina-mansour commented 2 years ago

Update: the flag -nthreads 0 was the cause of this issue.

Here are the outputs after removing the threading flag:

tcksift2: [INFO] opening image "/home/sina/Documents/Research/Datasets/UK_biobank/sample/1000243_2_0/dMRI/dMRI/wmfod_norm.mif"...
tcksift2: [INFO] image "/home/sina/Documents/Research/Datasets/UK_biobank/sample/1000243_2_0/dMRI/dMRI/wmfod_norm.mif" opened with dimensions 104x104x72x45, voxel spacing 2.0192299999999999x2.0192299999999999x2x1, datatype Float32LE
tcksift2: [100%] Creating homogeneous processing mask
tcksift2: [100%] segmenting FODs
tcksift2: [100%] mapping tracks to image
tcksift2: [INFO] Proportionality coefficient after streamline mapping is 0.00098915378859788078
tcksift2: [INFO] 131735 fixels have no attributed streamlines; these account for 7.854516744168988% of the initial cost function
tcksift2: [INFO] 49149 of 322753 fixels were tracked, but have been excluded from optimisation due to inadequate reconstruction;
tcksift2: [INFO] these contribute 3.7948157400504323% of the initial cost function
tcksift2: [INFO] Constant A scaling regularisation terms to match data term is 0.016847761147202763
tcksift2:   Iteration     CF (data)      CF (reg)     Streamlines
tcksift2: [done]        50        16.903%         1.372%        2190647
tcksift2: [done]        50        16.903%         1.372%        2190647

Lestropie commented 2 years ago

Okay, that's ... concerning. I'll have to dig into it.

Lestropie commented 2 years ago

PS. Using -nthreads 1 would be preferable to omitting the option entirely if jobs are going to be given access to only a single CPU core, since it will reduce unnecessary thread switching, and hopefully it will still circumvent the issue (-nthreads 0 changes the job handling code quite a lot, literally not spawning any threads, whereas -nthreads 1 will limit the number of threads spawned for the sole purpose of the bulk processing part to just 1, so ultimate performance is comparable).

sina-mansour commented 2 years ago

PS. Using -nthreads 1 would be preferable to omitting the option entirely if jobs are going to be given access to only a single CPU core, since it will reduce unnecessary thread switching, and hopefully it will still circumvent the issue (-nthreads 0 changes the job handling code quite a lot, literally not spawning any threads, whereas -nthreads 1 will limit the number of threads spawned for the sole purpose of the bulk processing part to just 1, so ultimate performance is comparable).

I'm wondering whether ommitting the -nthreads option from spartan jobs would be an even better option. I'm not sure how Spartan handles threading (i.e. whether a single cpu job would only have a single thread or more). From my understanding, if nthreads is not explicitly assigned MRtrix would chose an optimal value. For instance, on my PC, it uses all 8 threads available. Do you think omitting the threading flag from Spartan batch jobs would be a better choice?

Lestropie commented 2 years ago

So my prior point may not have been entirely accurate. I'm not actually familiar with how it is that HPC systems limit access to hardware resources based on the resource request as outlined in the SLURM script. But what I would expect to happen is that, if you execute a SLURM script that requests a single CPU core (and there's no multi-threading), the internal MRtrix3 check should report that there is only one thread available.

The behaviour of an internal multi-threaded step will then depend on the architecture of that particular multi-threading job. In the most general case, there might be:

A single-threaded "source" that sends the data to be processed down a multi-thread-safe queue;
A multi-threaded "pipe" that receives data from that queue, processes it, and sends the results down a second multi-thread-safe queue. The number of threads spawned for this task corresponds to either the number of threads that is reported to be available in hardware, or the number of threads requested by the user at the command-line;
A single-threaded "sink" that receives the data from the second queue.

In many cases, the source and sink threads mostly sit idle, and all the pipe threads take all of the time on the CPU(s).

Note here that if you specify -nthreads 1, or if the hardware check reports 1 thread available, there may nevertheless be as many as 3 threads running; the thread count only restricts the number of threads for the pipe.

The exception to this case is when specifically -nthreads 0 is used. Here, the multi-thread-safe queues are not even constructed. No new threads are spawned, and the single thread does the job of source, pipe and sink sequentially for each datum. So in the situation where there really is only one hardware thread available, -nthreads 0 may be the best option available, as -nthreads 1 leads to the need to spawn new threads, switch between threads, and do thread-locking, whereas -nthreads 0 does not.

Yes, if -nthreads is not explicitly used, MRtrix3 will allocate as many threads as whatever is reported as being available in hardware. But as discussed previously, from an HPC resource allocation perspective, running N jobs with 8 cores for 1 hour each is 8 times the resources as running N jobs with 1 core for 1 hour each. We had previously basically committed to doing each job single-threaded. I highly doubt that requesting that e.g. Spartan run a job that's flagged as wanting 1 core only will result in MRtrix3 spawning e.g. 8 threads; and even if it did, it would make the system administrators Very Angry. So either we commit to running every job multi-threaded, which requires setting the resource request in the SLURM script and budgeting accordingly, or we commit to running every job single-threaded, which also requires setting that request in the SLURM script, but I would additionally advise using -nthreads 0 for all MRtrix3 commands.

Lestropie commented 2 years ago

tcksift2 -nthreads 0 should now be fixed on ukb branch: https://github.com/MRtrix3/mrtrix3/commit/025e5afefb6c675aaa4d19c694a750305a193899

sina-mansour commented 2 years ago

tcksift2 -nthreads 0 should now be fixed on ukb branch: MRtrix3/mrtrix3@025e5af

Thanks Rob! It's now working. I only had a question regarding the single threaded runtime. It takes around 14 minutes on my pc to execute (~2M streamlines), do you reckon that it's normal? It takes around ~5 mins with 8 threads, so I'm guessing that 14 minutes on a single thread is normal, but just wanted to make sure that's normal.

Also, do you think there are ways to potentially speed this up (by changing the default parameters) without loss of accuracy/quality? If not, I guess we just need to run the whole pipeline on the computing clusters and reduce the seed count if the overall execution time was deemed too high.

Lestropie commented 2 years ago

Yeah, that's probably about what I'd expect.

It would theoretically be a little faster if I could implement a better non-linear solver; part of what contributes to the runtime is the slow convergence rate at later executions necessitating a larger number of iterations. But I've invested quite a lot of time in the past to trying to do this and never had success, so it's unlikely to happen. If you have any ideas for speeding up a ~ 500k x 2M non-linear sparse matrix solver I'm all ears.

One way to decrease runtime quite a bit is to use Algorithm 3 rather than Algorithm 4 as described here. The cost in FOD segmentation and streamline mapping is identical, but the optimisation from that point is not iterative like the SIFT2 implementation is, it only requires 2 passes over the data. But as I show in that manuscript, that approach does not provide the same guarantees that the SIFT2 algorithm does, and so the quantitative interpretation of the resulting connectivity is not as robust. So my personal preference would be to stick to SIFT2, reducing the streamline count by a small amount if need be.

sina-mansour commented 2 years ago

I just came across this thread. It suggests that the FOD image can be down-sampled to reduce the computational burden, but I'm not sure if this is still an option, or whether it could be too detrimental for the outputs.

Lestropie commented 2 years ago

I suppose it's technically an option given the computational constraints. There's a few different ways to compute the difference it makes; what's harder is to contextualise the magnitude of that difference...

sina-mansour commented 2 years ago

I just came across this thread. It suggests that the FOD image can be down-sampled to reduce the computational burden, but I'm not sure if this is still an option, or whether it could be too detrimental for the outputs.

So I've tried this idea of downsampling the FODs and here's what happens:

In terms of runtime, we get a 3x speedup after donsampling fods to 50% of the original resolution. (from 15 minutes to 5 minutes). That's a relatively considerable speedup (saving about 500,000 computing hours over all individual scans), however, what's difficult to evaluate is the degree of impact of this decision on the quality of FBC.

Here I've evaluated some basic stats and info about the differences:

These are the SIFT values for the first few streamlines:

1.289 , 0.4778, 0.843 , 1.477 , 0.8706, 0.4392, 0.2947, 0.439

And these are the same values after downsampling:

2.092 , 0.7524, 0.974 , 1.15 , 1.083 , 0.6094, 0.366 , 0.8125

These values change considerably, but the order is more or less similar. I computed the following stats over all 2M or so streamlines to get a better picture:

Correlation (Pearson's r): 0.828 Mean Absolute Error: 0.233

When mapping atlas level connectomes (for the SIFT2 metric) we get:

Correlation (Pearson's r): 0.9984 (for an atlas of 164 nodes) Correlation (Pearson's r): 0.9913 (for an atlas of 1054 nodes)

It seems that the per-streamline metrics are more affected that the final connectomes.

@Lestropie What are your thoughts regarding this? Downsampling does give significant speedup, but if the impact on quality is too much, we may want to just wait a week or two longer to get the accurate metrics instead.

Lestropie commented 2 years ago

Those numbers make intuitive sense to me. I'd expect pretty strong correlations for coarse parcellations since a decrease in weight of one streamline could easily be reflected in an increase in weight of an adjacent streamline that is assigned to the same edge. But as the parcellation becomes more fine, this no longer holds as much.

It is a reasonable speedup, but it's not leaps and bounds in the context of the whole pipeline. E.g. 15 minutes to 5 minutes sounds huge, but if in the context of the whole pipeline it's 3 hours to 2 hours and 50 minutes, it's not that big of a deal. Also us ourselves having to wait longer for processed data to come in is not really the right metric to think of, that's a tiny amount of time relative to the future utilisation of those data; relevant computational costs, or the costs of allocating that 10 minutes saved to instead e.g. generate more streamlines, is more relevant. I'm not hard against using the downsampling approach, but it is a deviation from typical processing that would be preferable to avoid.

There is however another factor to consider here before making a final decision: RAM. Say the nodes you're executing on are 32 threads and 128GB RAM. If each job requires 8GB RAM, then the HPC can't actually execute 32 jobs in parallel on one node, it can only execute 16. tcksift2 is likely the biggest memory hog in the pipeline, and downsampling will reduce that usage. So if the use of downsampling would reduce peak RAM usage of the pipeline to the point where it makes a pragmatic difference in parallelisation on the destination processing system, that's another factor in favour of its use. But you will need to know the specifications of the target system, and measure peak RAM usage with and without FOD downsampling.

Lestropie commented 2 years ago

Also just confirm that the processing time of 5 minutes quoted is the sum of the time required to perform both the downsampling and SIFT2, since the former would be a novel computational step introduced for this purpose.

sina-mansour commented 2 years ago

Also just confirm that the processing time of 5 minutes quoted is the sum of the time required to perform both the downsampling and SIFT2, since the former would be a novel computational step introduced for this purpose.

Yes, the time reported compares direct tcksift2 with downsampling + tcksift2.

I'll keep you updated with regards to both memory requirement and execution time on Spartan.

sina-mansour commented 2 years ago

Given that the gain in time is relatively modest compared to the overall pipeline duration. I think it would be better not to include the downsampling in the finalized pipeline.

This way we can also ensure that the values computed are valid and comparable to previous studies.

sina-mansour / UKB-connectomics

SIFT2 weights #24