kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
325 stars 119 forks source link

Failed to allocate 50331648 bytes for new constant #55

Open Violet969 opened 1 year ago

Violet969 commented 1 year ago

Hi all, I have a problem when run 'run_alphafold.sh', there always have the error like this.

2023-01-01 07:07:23.507834: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2130] Execution of replica 0 failed: INTERNAL: Failed to allocate 50331648 bytes for new constant Traceback (most recent call last): File "train.py", line 264, in app.run(main) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "train.py", line 216, in main state = jax.pmap(updater.init)(rng_pmap, data) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback return fun(*args, kwargs) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/api.py", line 2158, in cache_miss out_tree, out_flat = fpmapped(*args, *kwargs) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/api.py", line 2034, in pmap_f out = pxla.xla_pmap( File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2022, in bind return map_bind(self, fun, args, params) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2054, in map_bind outs = primitive.process(top_trace, fun, tracers, params) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 2025, in process return trace.process_map(self, fun, tracers, params) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/core.py", line 687, in process_call return primitive.impl(f, *tracers, params) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 841, in xla_pmap_impl return compiled_fun(args) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/_src/profiler.py", line 294, in wrapper return func(args, kwargs) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/jax/interpreters/pxla.py", line 1656, in call out_bufs = self.xla_executable.execute_sharded_on_local_devices(input_bufs) jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

The stack trace below excludes JAX-internal frames. The preceding is the original exception that occurred, unmodified.


The above exception was the direct cause of the following exception:

Traceback (most recent call last): File "train.py", line 264, in app.run(main) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 312, in run _run_main(main, args) File "/anaconda3/envs/alphafold_2/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main sys.exit(main(argv)) File "train.py", line 216, in main state = jax.pmap(updater.init)(rng_pmap, data) jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to allocate 50331648 bytes for new constant: while running replica 0 and partition 0 of a replicated computation (other replicas may have failed as well).

I have 8 nodes of 12G GPU, and 125G mem. Can anyone tell me how to solve it?

Old-Shatterhand commented 1 year ago

Hi @Violet969,

sorry for the late response. Can you please share the protein sequences to provide as an argument to AlphaFold?

What you report sounds like an out-of-memory problem. Please remember, even though you have multiple GPUs, only one will be used for execution as AlphaFold is not parallelized.

Best, Roman