CUDA error out of memory

uw-ipd / RoseTTAFold2

MIT License

160 stars 36 forks source link

CUDA error out of memory #29

Open lloydtripp opened 6 months ago

lloydtripp commented 6 months ago

Hello,

I've been folding single amino-acid substitute protein variants that are part of a heterotetramer complex. See the attached FASTA for an example fold query (SGCB_A9G.fa.txt).

This is the command to request the fold: /path/to/RosettaFold2/run_RF2.sh $fasta_file_location -o $output_directory --pair

3/4 of the models will generate but the last 1/4 will error out with a memory issue. See the Error logs for the traceback details. RF2_Job336369_99.out.txt RF2_Job336369_99.err.txt

The computing environment is IBM's LSF. The requested nodes have 64GB of RAM with a single TeslaV100_SXM2_32GB . Seems like the RMA doesn't cap out beyond 22GB. I don't have insight on the GPU utilization.

Is there anything I can do on my end? Can the code be fixed to deal with this issue? My temporary solution is to re-run fails jobs but this is not ideal.

Best, Lloyd Tripp

lloydtripp commented 6 months ago

I have a similar but different error now.

RF2-preMSA_Job557425_202.err.txt

robert-bolz commented 3 months ago

Just wanted to add I am also currently getting this issue, I am trying to run PDB:1aqf (tetramer complex)

I am running on the Ohio Super Computers Ascend cluster, which uses NVIDIA A100 80 GB gpus, here is the error I am getting:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.88 GiB (GPU 0; 79.15 GiB total capacity; 46.56 GiB already allocated; 32.00 GiB free; 46.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF~

Has anyone figured out a solution for this? I tried setting max_split_size_mb but no luck

Khas-Erdene-1 commented 2 months ago

Just wanted to add I am also currently getting this issue, I am trying to run PDB:1aqf (tetramer complex)

I am running on the Ohio Super Computers Ascend cluster, which uses NVIDIA A100 80 GB gpus, here is the error I am getting:

torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 32.88 GiB (GPU 0; 79.15 GiB total capacity; 46.56 GiB already allocated; 32.00 GiB free; 46.59 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF~

Has anyone figured out a solution for this? I tried setting max_split_size_mb but no luck

I run following code and updated the torch. After It ran normally pip install torch torchvision torchaudio