Synthstrip runs out of memory

Jane12345-all commented 2 years ago

Dear Dr.Oscar Esteban

We are getting an error using the code "docker run -it --rm -v /mnt/g/MRIQC/test/BIDS:/data:ro -v /mnt/g/MRIQC/test/results:/out nipreps/mriqc:latest /data /out participant --participant_label 068 --no-sub"

Any advice on where this error (below) is coming from and possible workarounds?

crash-20220704-085239-root-synthstrip.a0-614528bb-8bee-4207-bc2d-e6e7d7fe3bcd.txt

celprov commented 2 years ago

Hi Jane, I had the same problem. Looking into the crash log, return code 137 is emitted meaning that the container ran out of memory. The issue is caused by synthstrip being hungry in memory and the 2GB being the default memory allocated to a docker container. Allocating more memory (4GB) to the container using the --memory flag solved the problem in my case. Try : docker run -it --memory="4g" --rm -v /mnt/g/MRIQC/test/BIDS:/data:ro -v /mnt/g/MRIQC/test/results:/out nipreps/mriqc:latest /data /out participant --participant_label 068 --no-sub

Also if you're thinking to run MRIQC on more than one participant, it would be a good idea to keep launching only one participant at a time, using the --participant_label flag as you're doing now.

I hope it helps, Best, Celine

celprov commented 2 years ago

I am facing the problem that synthstrip sometimes still fails and returns code 137, even after allocating 24Gb to the docker container and running only one subject at a time. What puzzle me most is that increasing the allocated memory does not seem to reliably avoid the failure of Synthstrip. Some subject were able to run with only 4Gb, but some still fails with 24Gb. It feels like Synthstrip gets hungry for all the memory we allocate to it.

Here is the code I use to run MRIQC :

    sub_list=(sub-009673 sub-010769 sub-012059 sub-012322 sub-016504 sub-067018 sub-089207 sub-105822 sub-107738)
    bs=1 #batch size
    for ((i=0; i<=${#sub_list[@]}; i+=bs)); do
         #launch mriqc on batches of subjects
         batch=${sub_list[@]:$i:1}
         echo ${batch[@]}
         docker run -u $( id -u ) -it --memory="24g" --rm -v $HOME/datasets/MR-ART/:/data:ro -v $HOME/derivatives/mriqc/v22.0.6/MR-ART/:/out -v $HOME/tmp/mriqc/v22.0.6/MR-ART/:/work nipreps/mriqc:22.0.6 /data /out --ica --verbose-report participant --participant-label ${batch[@]} -w /work -vv
    done

oesteban commented 1 year ago

Hi @ahoopes, do you have any reports of Pytorch overeagerly allocating memory on your end?

Should we update the encapsulated version with some new models you have developed? Or else, any tips on how to limit the memory SynthStrip can allocate (with CPUs)?

oesteban commented 1 year ago

Okay, I've dug a bit more into this, and I have two non-exclusive explanations:

The OMP_NTHREADS bug/problem with numpy is hitting and PyTorch is using all available CPUs. This could easily be checked by using pytorch's API and set_threads.
Relatedly, Python is not releasing memory properly and it seems to replace linux's default malloc helps: https://discuss.pytorch.org/t/memory-leaks-at-inference/85108/14 (and with celery: https://zapier.com/engineering/celery-python-jemalloc/)

cc/ @esavary, @celprov, @effigies, @mgxd.

araikes commented 1 year ago

Tagging into this as I'm running into the same issue and want to keep track of progress on this.

oesteban commented 1 year ago

Reusing work from @esavary, I have profiled MRIQC in several settings. All the reported runs is on one dataset with one subject and 4 sessions with a T1w image each.

Command line run:

mriqc /data/datasets/hcph-pilot/ /data/derivatives/hcph-pilot/mriqc participant -m T1w -vv -w ./work --omp-nthreads 12 --nprocs 36

Vanilla MRIQC:
Using jemalloc: I'm convinced psutils is getting double the amount of memory for some reason I haven't quickly found googling.
Using mimalloc: Seemingly no gains.
Forcing malloc_trim at points and when spinning up new workers (i.e., for every nipype-task): This is done following https://www.softwareatscale.dev/p/run-python-servers-more-efficiently, basically adding
```
import ctypes
# ... code
ctypes.CDLL("libc.so.6").malloc_trim(0)
```
I have the impression that memory builds up slightly less, but nowhere close to addressing this problem.

cc/ @effigies @esavary @mgxd

jadenecke commented 1 year ago

Also tagging in, since for me even with nprocs=48; omp-nthreads=16 and 200Gb of memory some runs exit with error 137. Also in case you have not seen this Issue: https://github.com/freesurfer/freesurfer/issues/1032

celprov commented 1 year ago

@esavary could you please summarize what you have found here ?

esavary commented 1 year ago

Hi,

I think the problem happens during the inference with the model. During inference, only the feature map of the previous layer is necessary for the computation. So I expect that Pytorch only stores one feature map at a time. The dimension of the feature maps generated by the first and last convolutional layers of the model can be very large.

For example, with same padding and a stride of 1, the first layer will generate a feature map of 256x256x192x16x32=~6GB for a (256x256x192) image and float32 precision.

If we run multiple instances of SynthStrip simultaneously, we can easily exceed the memory limits of docker containers. Maybe an option could be allowing the users only to run one participant at a time?

oesteban commented 1 year ago

As we talked two IT monitoring meetings ago, I have been looking into running "massive" MRIQC processes with all subjects together and let nipype handle the memory. It seems that the memory fingerprint of the parent process doesn't go crazy with ~200 subjects (1 GB the parent process, plus around 1GB per worker in total, which is close to single-subject runs). That's great news.

The only problem I have encountered is the way nipype resolves the node dependency matrix. One "get_metadata" node per subject block the main thread (and very similarly for small processes run in workers) make the computation go really slow. I'm wondering whether this would not be a problem for pydra (@satra, @djarecka).

Before the pydra migration is finished, it may be a good exploration to try to make a "smarter" multiproc that can identify disjoint compute subgraphs (in the case of mriqc, there are no connections between subjects) and establish a thread pool for a number of them (obviously you don't want to saturate the number of threads). These threads could then submit jobs to the main process pool and its workers, creating an extra layer of perceived embarrassingly parallel execution.

Opinions? @effigies @mgxd

effigies commented 1 year ago

Where is the get_metadata? I don't see it in the workflow, so I'm not following the issue here.

oesteban commented 1 year ago

Apologies, I'm referring to https://github.com/nipreps/mriqc/blob/master/mriqc/workflows/anatomical.py#L384

effigies commented 1 year ago

I see. If that's a bottleneck, I would probably just reimplement it without pybids rather than rearchitect nipype.

oesteban commented 1 year ago

This node takes ~50s (which is PyBIDS' responsibility). However, the example is good - many of these theoretically lightweight interfaces could be handled better by nipype.

neurorepro commented 1 year ago

Hello, it seems i am suffering from this error as well. I am using the latest Docker image. Its release date (Mar 28) seems one day before the merge of the commit fixing this issue (Mar 29) (as we investigated with @celprov ).

However it is strange as the amount of RAM used (according to htop) does not seem to reach half of the total amount before my Ubuntu OS window manager crashes (causing a reset to log screen).

We have two Ubuntu workstations experiencing the same issue with the latest Docker image (crash to log screen). Not sure how to make sure this is indeed due to this specific issue.

nipreps / mriqc

Synthstrip runs out of memory #1004