ml-struct-bio / cryodrgn

Neural networks for cryo-EM reconstruction
http://cryodrgn.cs.princeton.edu
GNU General Public License v3.0
322 stars 76 forks source link

CUDA out of memory with abinit_het #403

Open cbeck22 opened 2 months ago

cbeck22 commented 2 months ago

Hi!

I'm trying to run cryodrgn abinit_het on 200K particles downsampled to 128, but the process quickly terminates with an error: torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 1.71 GiB. GPU

I get similar errors when I tried the following commands:

I get the same errors when running these commands on a workstation with 4 GPUs (NVIDIA GeForce GTX 1080, 8192 MiB memory each) and one with 2 GPUs (NVIDIA GeForce GTX 2080, 11264 MiB each).

I've also tried export PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True to help PyTorch avoid fragmentation, and tried using torch.cuda.set_per_process_memory_fraction(0.9) in a python shell to limit how much of the available memory that torch could use. However, neither approach worked.

I was able to successfully use train_vae for an even larger number of particles. Does the ab initio job in particular require a significant amount of GPU memory? Is there any way around this?

Cheers, Curtis

michal-g commented 2 months ago

Yup, everything else being equal, ab-initio reconstruction uses more memory than reconstruction with fixed poses. Can you also try export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 to see if that helps?

However, with a relatively small amount of memory (~10GB per GPU) you may be stuck running smaller models. The default for abinit_het is 256x3, so can you try something like 128x2? I would also be curious to see the output of the nvidia-smi command on one or both of your workstations, as well as if you have any way of profiling the memory usage of the processes running on them.

Best, Michal

cbeck22 commented 2 months ago

Hi Michal,

Thank you for your suggestions!

I tried using your export PYTORCH_CUDA_ALLOC_CONF=garbage_collection_threshold:0.6,max_split_size_mb:128 environment variable, but it doesn't seem to have helped - the job encounters the same out of memory error at the same pretrain iteration of 10000 (see log file below). The same happens when I use the 128x2 models for both the decoder and encoder: cryodrgn abinit_het particles_128_recentered/particles.128.txt --ctf ctf.pkl --zdim 8 --ind ind200k.pkl --enc-layers 2 --enc-dim 128 --dec-layers 2 --dec-dim 128 --multigpu -o abinitio/ > abinitio/abinito.log

Actually, now that I'm looking at the log file with fresh eyes, I noticed that the job runs into the out-of-memory error in what appears to be the final pretrain iteration. Is this the most memory intensive part of the job? In the meantime, I'm looking into using my institute's supercomputer cluster to run these ab initio jobs - their GPUs are much better than ours.

To answer your last question, I've been monitoring GPU memory usage by printing the results of nvidia-smi to a log file, and just comparing the timestamps with the timestamps from the ab initio log file. The output of nvidia-smi prints the memory usage of each of the two GPUs on this workstation (GeForce RTX 2080 Ti, 11.264 GB memory). It's not a very sophisticated method, so if you know of a better method, I'm all ears!

I forgot to ask this in my first post, but is it expected that despite using the --multigpu flag, cryoDRGN seems to only be using the memory of one of the two GPUs?

Cheers, cbeck

michal-g commented 2 months ago

While I look into the other details mentioned above, can you also try using the minimal batch size with --multigpu, that is, appending -b 1 to your commands, as this may also help resolve memory issues on a smaller workstation? I'd also like to see the output of just vanilla nvidia-smi on your workstation, that is, without query and output flags!

-Mike

cbeck22 commented 2 months ago

I appended -b 1 to the command, and it's running now! It finished the final pretrain iteration and is currently training epoch 1. I'll update you if the job finishes successfully. However, I'm getting the following warning incessantly:

/programs/x86_64-linux/cryodrgn/3.3.3/miniconda/lib/python3.9/site-packages/torch/nn/functional.py:4343: UserWarning: Default grid_sample and affine_grid behavior has changed to align_corners=False since 1.3.0. Please specify align_corners=True if the old behavior is desired. See the documentation of grid_sample for details.
  warnings.warn(

Additionally, could I ask you to explain what the batch size means and its relationship to GPU memory?

And here's the output for nvidia-smi, sorry for the misunderstanding!

Fri Sep 20 10:25:39 2024       
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.183.01             Driver Version: 535.183.01   CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:09:00.0  On |                  N/A |
|  0%   48C    P2              79W / 250W |   1099MiB / 11264MiB |     25%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce RTX 2080 Ti     Off | 00000000:0A:00.0 Off |                  N/A |
|  0%   38C    P8               2W / 250W |     18MiB / 11264MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A      2110      G   /usr/lib/xorg/Xorg                          102MiB |
|    0   N/A  N/A      2627      G   /usr/lib/xorg/Xorg                          105MiB |
|    0   N/A  N/A      6983      G   /usr/lib/xorg/Xorg                          175MiB |
|    0   N/A  N/A      7115      G   /usr/bin/gnome-shell                        128MiB |
|    0   N/A  N/A      8023      C   ...cryodrgn/3.3.3/miniconda/bin/python      552MiB |
|    1   N/A  N/A      2110      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      2627      G   /usr/lib/xorg/Xorg                            4MiB |
|    1   N/A  N/A      6983      G   /usr/lib/xorg/Xorg                            4MiB |
+---------------------------------------------------------------------------------------+

On a different note, I saw that you're one of the authors on the new DRGN-AI biorxiv preprint that was released a couple months ago - would you recommend DRGN-AI over cryoDRGN's ab initio? Does DRGN-AI have the same GPU memory requirements?

Thank you! cbeck

zhonge commented 1 month ago

Hello @cbeck22! If you're running into memory problems, I would make sure you're using a smaller decoder (e.g. 256x3) and decrease the batch size to -b 4 or -b 1.

DRGN-AI is our latest version of ab initio reconstruction. It should be much better (and faster), so l would give it a shot. (FYI we benchmarked both in cryobench). Let us know if you run into any problems. We're working on incorporating DRGN-AI into the next major version of cryoDRGN, but for now it's a standalone piece of software.