CUDA out of memory with abinit_het

Hi!

I'm trying to run cryodrgn abinit_het on 200K particles downsampled to 128, but the process quickly terminates with an error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.71 GiB. GPU

I get similar errors when I tried the following commands:

cryodrgn abinit_het particles_128_recentered/particles.128.txt --ctf ctf.pkl --zdim 8 --ind ind200k.pkl --lazy -o abinitio/ > abinitio/abinito.log
cryodrgn abinit_het particles_128_recentered/particles.128.txt --ctf ctf.pkl --zdim 8 --ind ind200k.pkl -o abinitio/ > abinitio/abinito.log
cryodrgn abinit_het particles_128_recentered/particles.128.txt --ctf ctf.pkl --zdim 8 --ind ind200k.pkl --multigpu -o abinitio/ > abinitio/abinito.log

I get the same errors when running these commands on a workstation with 4 GPUs (NVIDIA GeForce GTX 1080, 8192 MiB memory each) and one with 2 GPUs (NVIDIA GeForce GTX 2080, 11264 MiB each).

I've also tried export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to help PyTorch avoid fragmentation, and tried using torch.cuda.set_per_process_memory_fraction(0.9) in a python shell to limit how much of the available memory that torch could use. However, neither approach worked.

I was able to successfully use train_vae for an even larger number of particles. Does the ab initio job in particular require a significant amount of GPU memory? Is there any way around this?

Cheers, Curtis

Yup, everything else being equal, ab-initio reconstruction uses more memory than reconstruction with fixed poses. Can you also try export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 to see if that helps?

However, with a relatively small amount of memory (~10GB per GPU) you may be stuck running smaller models. The default for abinit_het is 256x3, so can you try something like 128x2? I would also be curious to see the output of the nvidia-smi command on one or both of your workstations, as well as if you have any way of profiling the memory usage of the processes running on them.

Best, Michal

Hi Michal,

Thank you for your suggestions!

I tried using your export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 environment variable, but it doesn't seem to have helped - the job encounters the same out of memory error at the same pretrain iteration of 10000 (see log file below). The same happens when I use the 128x2 models for both the decoder and encoder: cryodrgn abinit_het particles_128_recentered/particles.128.txt --ctf ctf.pkl --zdim 8 --ind ind200k.pkl --enc-layers 2 --enc-dim 128 --dec-layers 2 --dec-dim 128 --multigpu -o abinitio/ > abinitio/abinito.log

Actually, now that I'm looking at the log file with fresh eyes, I noticed that the job runs into the out-of-memory error in what appears to be the final pretrain iteration. Is this the most memory intensive part of the job? In the meantime, I'm looking into using my institute's supercomputer cluster to run these ab initio jobs - their GPUs are much better than ours.

To answer your last question, I've been monitoring GPU memory usage by printing the results of nvidia-smi to a log file, and just comparing the timestamps with the timestamps from the ab initio log file. The output of nvidia-smi prints the memory usage of each of the two GPUs on this workstation (GeForce RTX 2080 Ti, 11.264 GB memory). It's not a very sophisticated method, so if you know of a better method, I'm all ears!

Output from the log file for this ab initio command: cryodrgn abinit_het particles_128_recentered/particles.128.txt --ctf ctf.pkl --zdim 8 --ind ind200k.pkl --multigpu -o abinitio/ > abinitio/abinito.log

(INFO) (abinit_het.py) (19-Sep-24 10:40:45) Using random poses for 10000 iterations
(INFO) (abinit_het.py) (19-Sep-24 10:41:15) [Pretrain Iteration 2000] loss=0.591484
(INFO) (abinit_het.py) (19-Sep-24 10:41:43) [Pretrain Iteration 4000] loss=0.578094
(INFO) (abinit_het.py) (19-Sep-24 10:42:10) [Pretrain Iteration 6000] loss=0.592761
(INFO) (abinit_het.py) (19-Sep-24 10:42:37) [Pretrain Iteration 8000] loss=0.590331
(INFO) (abinit_het.py) (19-Sep-24 10:43:03) [Pretrain Iteration 10000] loss=0.587424
tail: abinito.log: file truncated

At this point, I get the following error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 2.35 GiB. GPU

Output from nvidia-smi --query-gpu=timestamp,memory.used --format=csv,nounits -l 1 > gpu_memory_log.csv

2024/09/19 10:42:50.045, 3251
2024/09/19 10:42:50.045, 18
2024/09/19 10:42:51.045, 3251
2024/09/19 10:42:51.045, 18
2024/09/19 10:42:52.045, 3251
2024/09/19 10:42:52.045, 18
2024/09/19 10:42:53.045, 3219
2024/09/19 10:42:53.046, 18
2024/09/19 10:42:54.046, 3251
2024/09/19 10:42:54.046, 18
2024/09/19 10:42:55.046, 3251
2024/09/19 10:42:55.046, 18
2024/09/19 10:42:56.046, 3219
2024/09/19 10:42:56.046, 18
2024/09/19 10:42:57.046, 3219
2024/09/19 10:42:57.046, 18
2024/09/19 10:42:58.047, 3219
2024/09/19 10:42:58.047, 18
2024/09/19 10:42:59.047, 3219
2024/09/19 10:42:59.047, 18
2024/09/19 10:43:00.047, 3219
2024/09/19 10:43:00.047, 18
2024/09/19 10:43:01.047, 3219
2024/09/19 10:43:01.047, 18
2024/09/19 10:43:02.047, 3219
2024/09/19 10:43:02.048, 18
2024/09/19 10:43:03.048, 3219
2024/09/19 10:43:03.048, 18
2024/09/19 10:43:04.048, 3219
2024/09/19 10:43:04.048, 18
2024/09/19 10:43:05.048, 9005
2024/09/19 10:43:05.048, 9306
2024/09/19 10:43:06.048, 9235
2024/09/19 10:43:06.049, 10418
2024/09/19 10:43:07.049, 827
2024/09/19 10:43:07.049, 15
2024/09/19 10:43:08.049, 827
2024/09/19 10:43:08.049, 15

It appears that the final pretrain iteration tries to request an additional 2.35 GiB beyond the 11.264 GB memory that the GPU actually has, and it requests the same 2.35 GiB of additional memory regardless of whether I use your PYTORCH_CUDA_ALLOC_CONF environment variable or smaller model architecture.

I forgot to ask this in my first post, but is it expected that despite using the --multigpu flag, cryoDRGN seems to only be using the memory of one of the two GPUs?

Cheers, cbeck

While I look into the other details mentioned above, can you also try using the minimal batch size with --multigpu, that is, appending -b 1 to your commands, as this may also help resolve memory issues on a smaller workstation? I'd also like to see the output of just vanilla nvidia-smi on your workstation, that is, without query and output flags!

-Mike

I appended -b 1 to the command, and it's running now! It finished the final pretrain iteration and is currently training epoch 1. I'll update you if the job finishes successfully. However, I'm getting the following warning incessantly:

/programs/x86_64-linux/cryodrgn/3.3.3/miniconda/lib/python3.9/site-packages/torch/nn/functional.py:4343: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  warnings.warn(

Additionally, could I ask you to explain what the batch size means and its relationship to GPU memory?

And here's the output for nvidia-smi, sorry for the misunderstanding!

Fri Sep 20 10:25:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:09:00.0  On |                  N/A |
|  0%   48C    P2              79W / 250W |   1099MiB / 11264MiB |     25%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:0A:00.0 Off |                  N/A |
|  0%   38C    P8               2W / 250W |     18MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2110      G   /usr/lib/xorg/Xorg                          102MiB |
|    0   N/A  N/A      2627      G   /usr/lib/xorg/Xorg                          105MiB |
|    0   N/A  N/A      6983      G   /usr/lib/xorg/Xorg                          175MiB |
|    0   N/A  N/A      7115      G   /usr/bin/gnome-shell                        128MiB |
|    0   N/A  N/A      8023      C   ...cryodrgn/3.3.3/miniconda/bin/python      552MiB |
|    1   N/A  N/A      2110      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      2627      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      6983      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

On a different note, I saw that you're one of the authors on the new DRGN-AI biorxiv preprint that was released a couple months ago - would you recommend DRGN-AI over cryoDRGN's ab initio? Does DRGN-AI have the same GPU memory requirements?

Thank you! cbeck

Hello @cbeck22! If you're running into memory problems, I would make sure you're using a smaller decoder (e.g. 256x3) and decrease the batch size to -b 4 or -b 1.

DRGN-AI is our latest version of ab initio reconstruction. It should be much better (and faster), so l would give it a shot. (FYI we benchmarked both in cryobench). Let us know if you run into any problems. We're working on incorporating DRGN-AI into the next major version of cryoDRGN, but for now it's a standalone piece of software.

ml-struct-bio / cryodrgn

CUDA out of memory with abinit_het #403