kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
331 stars 119 forks source link

alphafold only uses one gpu #10

Closed monororo closed 2 years ago

monororo commented 2 years ago

I followed the instruction and successfully installed alphafold on cluster. It partically works, but only one gpu get used.

I added some code in scripts. Logs showed tensorflow did discover 2 gpu, but nvidia-smi revealed data and computing occupied at gpu 0, gpu 1 was idle.

here is log:

$HOME/.local/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  warnings.warn(
I0813 13:45:04.035152 140042046441280 templates.py:837] Using precomputed obsolete pdbs $DATA/pdb_mmcif/obsolete.dat.
I0813 13:45:05.206957 140042046441280 tpu_client.py:54] Starting the local TPU driver.
I0813 13:45:05.239395 140042046441280 xla_bridge.py:214] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I0813 13:45:05.629722 140042046441280 xla_bridge.py:214] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I0813 13:45:14.966193 140042046441280 run_alphafold.ano.py:284] Have 5 models: ['model_1', 'model_2', 'model_3', 'model_4', 'model_5']
I0813 13:45:14.966413 140042046441280 run_alphafold.ano.py:297] Using random seed 8606097073378666681 for the data pipeline
I0813 13:45:15.419880 140042046441280 run_alphafold.ano.py:155] Running model model_1
2021-08-13 13:46:07.772502: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 6942677504 exceeds 10% of free system memory.
I0813 13:46:09.743540 140042046441280 model.py:145] Running predict with shape(feat) = {'aatype': (32, 2179), 'residue_index': (32, 2179), 'seq_length': (32,), 'template_aatype': (32, 4, 2179), 'template_all_atom_masks': (32, 4, 2179, 37), 'template_all_atom_positions': (32, 4, 2179, 37, 3), 'template_sum_probs': (32, 4, 1), 'is_distillation': (32,), 'seq_mask': (32, 2179), 'msa_mask': (32, 508, 2179), 'msa_row_mask': (32, 508), 'random_crop_to_size_seed': (32, 2), 'template_mask': (32, 4), 'template_pseudo_beta': (32, 4, 2179, 3), 'template_pseudo_beta_mask': (32, 4, 2179), 'atom14_atom_exists': (32, 2179, 14), 'residx_atom14_to_atom37': (32, 2179, 14), 'residx_atom37_to_atom14': (32, 2179, 37), 'atom37_atom_exists': (32, 2179, 37), 'extra_msa': (32, 5120, 2179), 'extra_msa_mask': (32, 5120, 2179), 'extra_msa_row_mask': (32, 5120), 'bert_mask': (32, 508, 2179), 'true_msa': (32, 508, 2179), 'extra_has_deletion': (32, 5120, 2179), 'extra_deletion_value': (32, 5120, 2179), 'msa_feat': (32, 508, 2179, 49), 'target_feat': (32, 2179, 22)}
2021-08-13 13:49:42.988439: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 39.13GiB (rounded to 42012920064)requested by op 
2021-08-13 13:49:42.991276: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:468] *******************************************************_____________________________________________
2021-08-13 13:49:42.991431: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2040] Execution of replica 0 failed: Resource exhausted: Out of memory while trying to allocate 42012919928 bytes.
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
running process_features
2021-08-13 13:45:15 running: process_features
Traceback (most recent call last):
  File "run_alphafold.ano.py", line 328, in <module>
    app.run(main)
  File "$HOME/.local/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "$HOME/.local/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "run_alphafold.ano.py", line 301, in main
    predict_structure(
  File "run_alphafold.ano.py", line 162, in predict_structure
    prediction_result = model_runner.predict(processed_feature_dict)
  File "$HOME/alphafold/alphafold-2.0/alphafold/model/model.py", line 147, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/_src/api.py", line 399, in cache_miss
    out_flat = xla.xla_call(
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1561, in bind
    return call_bind(self, fun, *args, **params)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1552, in call_bind
    outs = primitive.process(top_trace, fun, tracers, params)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1564, in process
    return trace.process_call(self, fun, tracers, params)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 607, in process_call
    return primitive.impl(f, *tracers, **params)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 610, in _xla_call_impl
    return compiled_fun(*args)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 898, in _execute_compiled
    out_bufs = compiled.execute(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 42012919928 bytes.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_alphafold.ano.py", line 328, in <module>
    app.run(main)
  File "$HOME/.local/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "$HOME/.local/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "run_alphafold.ano.py", line 301, in main
    predict_structure(
  File "run_alphafold.ano.py", line 162, in predict_structure
    prediction_result = model_runner.predict(processed_feature_dict)
  File "$HOME/alphafold/alphafold-2.0/alphafold/model/model.py", line 147, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 898, in _execute_compiled
    out_bufs = compiled.execute(input_bufs)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 42012919928 bytes.

here is nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                  Off |
| N/A   39C    P0    67W / 300W |  29754MiB / 32510MiB |     62%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:0B.0 Off |                  Off |
| N/A   39C    P0    55W / 300W |    496MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     52643      C   python                          29749MiB |
|    1   N/A  N/A     52643      C   python                            491MiB |
+-----------------------------------------------------------------------------+
sanjaysrikakulam commented 2 years ago

Hi @monororo

Our bash script is a wrapper around AF2. AF2 might or might not run on multiple GPUs. Our wrapper script seems to be working fine here because it makes sure both the GPUs are made available to the AF2 and it's up to AF2 to use them.

Some threads I found related to this in AF2's github repo.

https://github.com/deepmind/alphafold/issues/66 https://github.com/deepmind/alphafold/issues/30

Hope this helps.