LN2_LAYERS requires a lot of RAM during with whole brain 0.25 isotropic resolution input

marcobarilari commented 9 months ago

Hi there,

I want to report that during "layerificaiton" with LN2_LAYERS I had a problem with too much RAM used with the consequence of the system killing the running process.

To give you more info:

Data nature:

rim derived by recon-all (freesurfer) on a 0.75 mm iso MP2RAGE, then upsampled to 0.25 mm iso (weight: 875 MB gzipped)

Computer specs:

linux crunch machine (aka the "labMonster") with 64 GB of ram

LAYNII version:

2.3.0

Command:

acqID='r0p25'

nb_layers=3

LN2_LAYERS \
    -rim sub-${subID}_acq-${acqID}_space-individual_desc-freesurferseg_label-AverageAseg_mask_rim.nii.gz \
    -nr_layers $nb_layers \
    -equivol \
    -thickness \
    -output sub-${subID}_acq-${acqID}_space-individual_desc-${nb_layers}layers_label-AveragewholeBrain_mask.nii.gz

My workaround was to increase the swap memory to 200 GB and it is working, with ~40 min of CPU time and ~ 120 of memory space (RAM + swap) occupied. As far as I get, on macs and windows this swapping thing is by default so it is possible that the problem might occur on Linux machines or macs/windows with almost full HD.

Do you have any suggestions? Does it make sense?

Marco

layerfMRI commented 9 months ago

I just tried if on my mac too. And I can confirm that a RIM file with (unzipped) size of 3.5GB (gzipped 875MB) makes LN2_LAYERS use at least 123GB of RAM.

In the code of LN2_LAYERS there are 36 allocations of the 3d matrix (instances of nifti_image*). So I think the numbers check out. To solve this, I think there are three options: 1.) Maybe we can free up the space of interim arrays that are generated when they are no longer needed and before more arrays are allocated? E.g. maybe with something like delete [] nifti_image. 2.) Maybe we can reuse some of the interim arrays. Which might make the code harder to read :-/ 3.) just buy bigger computers ;-)

Any suggestion @ofgulban ?

ofgulban commented 9 months ago

Hello @marcobarilari ,

Thanks for opening this issue. I have been looking for something like this to come up to justify further memory optimization in LN2_LAYERS. @layerfMRI I already have something in mind to deflate the memory usage, exploiting the sparsity of the gray matter voxels in the whole brain.

With regards to swap pointers, I think it sounds correct. @marcobarilari can you send me your rim file (if you wish via email), so that I can check the new optimizations directly on your case. I assume that you do not need depend on this optimization, and already have a working solution for yourself, right? It might take a couple of weeks for me to open up some time for this.

marcobarilari commented 9 months ago

Hi,

Thank you very much for your answer. Uploading the file on the cloud and then will send you the link.

LMK if you need more info etc.

Marco

ofgulban commented 7 months ago

I have added LN2_RIM_BORDERIZE program to speed up the computation time of LN2_LAYERS (with 185f5c6dfa7fcb456a3af8e3c2194bca52b8dd43). This is possible because there is already an optimization step in place that goes faster if the rim file consists of "hollowed out" non gray matter voxel labels (attached below). However, note that this does not result in RAM optimization for now.

before_after

ofgulban commented 7 months ago

[Update] I have started working on writing a new program LN3_LAYERS (in devel branch). I have implemented a rather sophisticated RAM optimization to deflate the requirements. Currently, equidistant layerification computations are implemented and working. 100 micron isotorpic whole brain dataset (BigBrain) takes around 8 minutes (~25% faster than LN2_LAYERS on my laptop) to compute and consumes ~15 GB RAM as opposed to hundreds. This is a massive improvement and I am happy with it. Though, I might be able to decrease it a bit more. I will proceed with implementing equivolume computations next.

layerfMRI commented 7 months ago

Wow thats magic. I tested it (without include borders, without equibins, without equivol). And for me the improvement is even higher.

for a 0.3mm whole brain MP2RGE scan it is an almost three fold improvement of time, because of the reduced use of SWAP. with less RAM available (i have 64GB on my MAC M1), I anticipate a speed improvement of more than an order of magnitude. How did you achieve this?

ofgulban commented 7 months ago

Cool that you already tried :D. Many features / outputs are not implemented yet so probably the requirements will increase a bit. However, the core improvements are that:

I have incorporated LN2_RIM_BORDERIZE algorithm to figure out and only allocate memory for the gray matter voxels and their immediate neighbors (borders). It turns out often around 20% of the whole brain voxels are cortical gray matter (in a tight whole brain coverage) so by allocating memory only for those deflates the RAM 80%. This number is now printed as "sparsity" measurement.
Looping over fever voxels also speeds up all the computations. Note that this is also done in LN2_LAYERS when a borderized rim file is given (using LN2_RIM_BORDERIZE on the input rim).
In addition, I have devised a looping method that gets faster as more voxels become determined. This equates to ~2X faster looping while "growing wavefronts" from inner and outer boundaries.

marcobarilari commented 4 months ago

Hey there! I missed all the updates on this issue somehow. I apologize for that.

Thank you very much @ofgulban for working on this, it is amazing that you managed to make LAYNII even more efficient.

I tried to test myself LN3_LAYERS but I did not manage to compile the new function. What I did was:

Clone a new LAYNII repo locally
Switch to devel
make all from within the repo folder

However, this does not compile LN3_LAYERS (i.e. I don't see it in the main folder together with the other functions).

Any hint? I am a total noob in c++ or similar languages

marcobarilari commented 4 months ago

Ok never mind I figured it out, I guess make all did not work cause LN3_LAYERS is not part of the list of programs to compile. When running make ./src/LN3LAYERS, then it was compiled correctly.

Anyway, it is great! It took only 2 minutes for a whole brain rim at 0.25mm and no disk hagging compared to 10 minutes and lots of writing/reading which with a full HD it would fail.

In the output, though I am noticing small differences (see the screenshots, 1) is LN3_LAYERS and 2) is LN2_LAYERS). Do you know why and if it could be of concern?

Screenshot from 2024-06-10 17-01-02

The command run was LN3_LAYERS -rim rim123_space-ANAT.nii.gz -nr_layers 7

Thanks again for taking the time for this enhancement :)

ofgulban commented 4 months ago

Hi @marcobarilari ,

These differences are due to the way LN3_LAYERS detect "border neighbors" automatically (this is apart of the necessary RAM optimization). If you can post your rim file screenshot, I can more confidently tell you whether this is the source of the differences or not. I would not be concerned about this but again this is a very much "work in progress" program right now.

Also, yes indeed LN3_LAYERS is not a part of the main compilation right now. You need to compile it with make LN3_LAYERS command.

marcobarilari commented 4 months ago

Hi! Thank you for your reply.

Here below new screenshots with the rim as well. LMK if you need more info. The order is LN2_LAYERS, LN3_LAYERS, and rim

Screenshot from 2024-06-12 12-04-57

layerfMRI / LAYNII

LN2_LAYERS requires a lot of RAM during with whole brain 0.25 isotropic resolution input #95