jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED

infinity01 commented 1 year ago

Hi David,

Was wondering if I can get some advise on how to prevent an out of memory condition on our slurm cluster. The node I'm submitting on contains a single nvidia v100 gpu with 16GB i can't seem to figure out the right parameters to get a test run to finish successfully.

I've been playing around in the run_singularity.py script with a mixture of the following variables:

TF_FORCE_UNIFIED_MEMORY=1 XLA_PYTHON_CLIENT_MEM_FRACTION=.80 XLA_PYTHON_CLIENT_ALLOCATOR=platform XLA_PYTHON_CLIENT_PREALLOCATE=false

Here is the error that I'm seeing in the slurm output file. But I've also seen: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Out of memory while trying to allocate 6567780864 bytes.

Thanks in advance!!

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/alphafold/run_alphafold.py", line 422, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/app/alphafold/run_alphafold.py", line 406, in main
    random_seed=random_seed)
  File "/app/alphafold/run_alphafold.py", line 199, in predict_structure
    random_seed=model_random_seed)
  File "/app/alphafold/alphafold/model/model.py", line 167, in predict
    result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Failed to allocate request for 6.12GiB (6567780864B) on device ordinal 0
INFO: AlphaFold returned 0

prehensilecode commented 1 year ago

How much system memory did this job request? Maybe try increasing the system memory request.

If I understand correctly, TensorFlow unified memory should "overflow" to system memory. See:

You might also try logging in to the node running your job while it is running, and using nvidia-smi to observe GPU device memory usage. Your system may have other monitoring tools that displays that info, too, like Ganglia or similar.

The systems on the cluster I manage have V100 32 GB, so I have not run into this issue.

infinity01 commented 1 year ago

I requested 100GB of system memory in the slurm job.

The error states it couldnt allocate 6.12 GB. Our v100 GPUs are 16 GB so I'm not sure why its having issues with 6GB jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED: Failed to allocate request for 6.12GiB (6567780864B) on device ordinal 0

`

prehensilecode commented 1 year ago

Maybe there are other GPU kernels which are using GPU memory, as well.

Can you provide the job script, including the full command line used to run AlphaFold? And links to any data needed to run the job. I will try to reproduce.

I'm afraid I won't have much time in the next few weeks to work on this. I'll work on it as time allows.

infinity01 commented 1 year ago

run_singularity.py.txt

infinity01 commented 1 year ago

Emue_salk.fasta.txt

infinity01 commented 1 year ago

alphafold.slurm.txt

infinity01 commented 1 year ago

Thank you so much for your time!

prehensilecode commented 1 year ago

Sorry, I think this is a question for the AlphaFold devs, since it is not an issue with the Singularity container, but with the execution of AlphaFold itself.

prehensilecode commented 1 year ago

Please try the newly released version 2.3.1. A prebuilt image is available: https://cloud.sylabs.io/library/prehensilecode/alphafold_singularity/alphafold

prehensilecode / alphafold_singularity

jaxlib.xla_extension.XlaRuntimeError: RESOURCE_EXHAUSTED #17