google-deepmind / alphafold

Open source code for AlphaFold.
Apache License 2.0
12.41k stars 2.21k forks source link

Is it possible to run AlphaFold only on RAM (without GPU)? Error:'Out of memory while trying to allocate 1740687616 bytes' #417

Closed AlbaAlba1 closed 2 years ago

AlbaAlba1 commented 2 years ago

Hi all,

I am stuck with the following issue for some time: 'Out of memory while trying to allocate 1740687616 bytes' I am trying to run prediction for a very big protein (~2.5K aminoacids) and this error happens while running model 1 of prediction. I managed to run AlphaFold for small proteins and even a complex of 3 small proteins (~1350aa in total) with no problems before.

I am running with Docker on Linux Ubuntu 20.04, my Nvidia drivers are fine (version 510.54, CUDA version 11.6). At first I thought the problem is with RAM, so I increased swap file and enabled zram (in total 200GB) - it did not help. Now I upgraded my RAM from 64GB to 128GB. Upgrading RAM did not help as well, I have the same out of memory error as before with 64GB RAM.

I also did the trick with commenting out 2 lines in run_docker.py script from issue#197 (https://github.com/deepmind/alphafold/issues/197) and it did partially help with memory: before the error was 'Out of memory while trying to allocate 43711962272 bytes'', after changing the script it became current 'Out of memory while trying to allocate 1740687616 bytes', but still it is not solved.

What I suspect is that my graphic card is a bottleneck because it has only 4GB memory. In the log file I see 'W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.62GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available'

I cannot buy new graphic card, so maybe it is pretty naive to ask, but is it possible to run everything with only RAM not including GPU?

I hope there is somebody who can help, because I am struggling with it for several weeks already.

Here is the log file:

0401 14:45:07.110614 139983873591104 run_docker.py:247] I0401 12:45:07.109923 140456015222592 run_alphafold.py:198] Running model model_1 on TGME49_244470 I0401 14:45:17.426382 139983873591104 run_docker.py:247] I0401 12:45:17.425668 140456015222592 model.py:166] Running predict with shape(feat) = {'aatype': (4, 2595), 'residue_index': (4, 2595), 'seq_length': (4,), 'template_aatype': (4, 4, 2595), 'template_all_atom_masks': (4, 4, 2595, 37), 'template_all_atom_positions': (4, 4, 2595, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 2595), 'msa_mask': (4, 508, 2595), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 2595, 3), 'template_pseudo_beta_mask': (4, 4, 2595), 'atom14_atom_exists': (4, 2595, 14), 'residx_atom14_to_atom37': (4, 2595, 14), 'residx_atom37_to_atom14': (4, 2595, 37), 'atom37_atom_exists': (4, 2595, 37), 'extra_msa': (4, 5120, 2595), 'extra_msa_mask': (4, 5120, 2595), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 508, 2595), 'true_msa': (4, 508, 2595), 'extra_has_deletion': (4, 5120, 2595), 'extra_deletion_value': (4, 5120, 2595), 'msa_feat': (4, 508, 2595, 49), 'target_feat': (4, 2595, 22)} I0401 14:46:08.727567 139983873591104 run_docker.py:247] 2022-04-01 12:46:08.727055: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:272] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.62GiB with freed_by_count=0. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory were available. I0401 14:46:08.784124 139983873591104 run_docker.py:247] Traceback (most recent call last): I0401 14:46:08.784266 139983873591104 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 445, in I0401 14:46:08.784316 139983873591104 run_docker.py:247] app.run(main) I0401 14:46:08.784386 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run I0401 14:46:08.784427 139983873591104 run_docker.py:247] _run_main(main, args) I0401 14:46:08.784466 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main I0401 14:46:08.784528 139983873591104 run_docker.py:247] sys.exit(main(argv)) I0401 14:46:08.784602 139983873591104 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 429, in main I0401 14:46:08.784655 139983873591104 run_docker.py:247] is_prokaryote=is_prokaryote) I0401 14:46:08.784713 139983873591104 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 207, in predict_structure I0401 14:46:08.784751 139983873591104 run_docker.py:247] random_seed=model_random_seed) I0401 14:46:08.784790 139983873591104 run_docker.py:247] File "/app/alphafold/alphafold/model/model.py", line 167, in predict I0401 14:46:08.784828 139983873591104 run_docker.py:247] result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat) I0401 14:46:08.784867 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback I0401 14:46:08.784906 139983873591104 run_docker.py:247] return fun(*args, kwargs) I0401 14:46:08.784960 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/_src/api.py", line 427, in cache_miss I0401 14:46:08.784998 139983873591104 run_docker.py:247] donated_invars=donated_invars, inline=inline) I0401 14:46:08.785036 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 1560, in bind I0401 14:46:08.785074 139983873591104 run_docker.py:247] return call_bind(self, fun, *args, *params) I0401 14:46:08.785112 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 1551, in call_bind I0401 14:46:08.785150 139983873591104 run_docker.py:247] outs = primitive.process(top_trace, fun, tracers, params) I0401 14:46:08.785187 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 1563, in process I0401 14:46:08.785224 139983873591104 run_docker.py:247] return trace.process_call(self, fun, tracers, params) I0401 14:46:08.785262 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/core.py", line 606, in process_call I0401 14:46:08.785300 139983873591104 run_docker.py:247] return primitive.impl(f, tracers, params) I0401 14:46:08.785337 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 593, in _xla_call_impl I0401 14:46:08.785375 139983873591104 run_docker.py:247] unsafe_map(arg_spec, args)) I0401 14:46:08.785412 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/linear_util.py", line 262, in memoized_fun I0401 14:46:08.785450 139983873591104 run_docker.py:247] ans = call(fun, args) I0401 14:46:08.785488 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 743, in _xla_callable I0401 14:46:08.785527 139983873591104 run_docker.py:247] compiled = backend_compile(backend, built, options) I0401 14:46:08.785565 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 360, in backend_compile I0401 14:46:08.785603 139983873591104 run_docker.py:247] return backend.compile(built_c, compile_options=options) I0401 14:46:08.785641 139983873591104 run_docker.py:247] jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 1740687616 bytes. I0401 14:46:08.785704 139983873591104 run_docker.py:247] I0401 14:46:08.785744 139983873591104 run_docker.py:247] The stack trace below excludes JAX-internal frames. I0401 14:46:08.785783 139983873591104 run_docker.py:247] The preceding is the original exception that occurred, unmodified. I0401 14:46:08.785837 139983873591104 run_docker.py:247] I0401 14:46:08.785875 139983873591104 run_docker.py:247] -------------------- I0401 14:46:08.785913 139983873591104 run_docker.py:247] I0401 14:46:08.785952 139983873591104 run_docker.py:247] The above exception was the direct cause of the following exception: I0401 14:46:08.785990 139983873591104 run_docker.py:247] I0401 14:46:08.786027 139983873591104 run_docker.py:247] Traceback (most recent call last): I0401 14:46:08.786065 139983873591104 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 445, in I0401 14:46:08.786102 139983873591104 run_docker.py:247] app.run(main) I0401 14:46:08.786140 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 312, in run I0401 14:46:08.786179 139983873591104 run_docker.py:247] _run_main(main, args) I0401 14:46:08.786216 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main I0401 14:46:08.786253 139983873591104 run_docker.py:247] sys.exit(main(argv)) I0401 14:46:08.786291 139983873591104 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 429, in main I0401 14:46:08.786329 139983873591104 run_docker.py:247] is_prokaryote=is_prokaryote) I0401 14:46:08.786367 139983873591104 run_docker.py:247] File "/app/alphafold/run_alphafold.py", line 207, in predict_structure I0401 14:46:08.786405 139983873591104 run_docker.py:247] random_seed=model_random_seed) I0401 14:46:08.786443 139983873591104 run_docker.py:247] File "/app/alphafold/alphafold/model/model.py", line 167, in predict I0401 14:46:08.786482 139983873591104 run_docker.py:247] result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat) I0401 14:46:08.786520 139983873591104 run_docker.py:247] File "/opt/conda/lib/python3.7/site-packages/jax/interpreters/xla.py", line 360, in backend_compile I0401 14:46:08.786557 139983873591104 run_docker.py:247] return backend.compile(built_c, compile_options=options) I0401 14:46:08.786595 139983873591104 run_docker.py:247] RuntimeError: Resource exhausted: Out of memory while trying to allocate 1740687616 bytes.

AnyaP commented 2 years ago

Hi,

Thanks for your question. This does indeed look like the GPU OOM error, and for a protein of this size (2500 amino acids), it is not uncommon. Unfortunately, running the inference on the CPU (with RAM) would not be a good option either, as it would take many-many hours, if not days.

You could try folding the protein in chunks by manually splitting it and then re-assembling the results, but this approach involves manual interventions and has not been validated for quality.

Thus, I'm afraid there are no good solutions, other than getting access to a GPU with more memory.

lbw124765283 commented 2 years ago

Dear friends, I have the same questions. RuntimeError: Resource exhausted: Out of memory while trying to allocate 85833996152 bytes. I am trying to run prediction for a very big protein (~1.3K aminoacids with 3 polymers) and this error happens while running model 1 of prediction. I have 512G memory, 4 A100 GPUs. But I also cannot solve this problem. Is AlphaFold unable to call so many Gpus? So I wonder if I can add GPUs or I can get other ways to solve this problem.

YiningWang2 commented 2 years ago

Hello, I am trying this way to predict long protein sequence(more than 2000 AAs). Is there any way to re-assemble the chunk results?