Closed infinity01 closed 1 year ago
How much system memory did this job request? Maybe try increasing the system memory request.
If I understand correctly, TensorFlow unified memory should "overflow" to system memory. See:
You might also try logging in to the node running your job while it is running, and using nvidia-smi
to observe GPU device memory usage. Your system may have other monitoring tools that displays that info, too, like Ganglia or similar.
The systems on the cluster I manage have V100 32 GB, so I have not run into this issue.
I requested 100GB of system memory in the slurm job.
The error states it couldnt allocate 6.12 GB. Our v100 GPUs are 16 GB so I'm not sure why its having issues with 6GB
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Failed to allocate request for 6.12GiB (6567780864B) on device ordinal 0
`
Maybe there are other GPU kernels which are using GPU memory, as well.
Can you provide the job script, including the full command line used to run AlphaFold? And links to any data needed to run the job. I will try to reproduce.
I'm afraid I won't have much time in the next few weeks to work on this. I'll work on it as time allows.
Thank you so much for your time!
Sorry, I think this is a question for the AlphaFold devs, since it is not an issue with the Singularity container, but with the execution of AlphaFold itself.
Please try the newly released version 2.3.1. A prebuilt image is available: https://cloud.sylabs.io/library/prehensilecode/alphafold_singularity/alphafold
Hi David,
Was wondering if I can get some advise on how to prevent an out of memory condition on our slurm cluster. The node I'm submitting on contains a single nvidia v100 gpu with 16GB i can't seem to figure out the right parameters to get a test run to finish successfully.
I've been playing around in the run_singularity.py script with a mixture of the following variables:
TF_FORCE_UNIFIED_MEMORY=1 XLA_PYTHON_CLIENT_MEM_FRACTION=.80 XLA_PYTHON_CLIENT_ALLOCATOR=platform XLA_PYTHON_CLIENT_PREALLOCATE=false
Here is the error that I'm seeing in the slurm output file. But I've also seen: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 6567780864 bytes.
Thanks in advance!!