alphafold only uses one gpu

I followed the instruction and successfully installed alphafold on cluster. It partically works, but only one gpu get used.

I added some code in scripts. Logs showed tensorflow did discover 2 gpu, but nvidia-smi revealed data and computing occupied at gpu 0, gpu 1 was idle.

here is log:

$HOME/.local/lib/python3.8/site-packages/absl/flags/_validators.py:203: UserWarning: Flag --preset has a non-None default value; therefore, mark_flag_as_required will pass even if flag is not specified in the command line!
  warnings.warn(
I0813 13:45:04.035152 140042046441280 templates.py:837] Using precomputed obsolete pdbs $DATA/pdb_mmcif/obsolete.dat.
I0813 13:45:05.206957 140042046441280 tpu_client.py:54] Starting the local TPU driver.
I0813 13:45:05.239395 140042046441280 xla_bridge.py:214] Unable to initialize backend 'tpu_driver': Not found: Unable to find driver in registry given worker: local://
I0813 13:45:05.629722 140042046441280 xla_bridge.py:214] Unable to initialize backend 'tpu': Invalid argument: TpuPlatform is not available.
I0813 13:45:14.966193 140042046441280 run_alphafold.ano.py:284] Have 5 models: ['model_1', 'model_2', 'model_3', 'model_4', 'model_5']
I0813 13:45:14.966413 140042046441280 run_alphafold.ano.py:297] Using random seed 8606097073378666681 for the data pipeline
I0813 13:45:15.419880 140042046441280 run_alphafold.ano.py:155] Running model model_1
2021-08-13 13:46:07.772502: W tensorflow/core/framework/cpu_allocator_impl.cc:80] Allocation of 6942677504 exceeds 10% of free system memory.
I0813 13:46:09.743540 140042046441280 model.py:145] Running predict with shape(feat) = {'aatype': (32, 2179), 'residue_index': (32, 2179), 'seq_length': (32,), 'template_aatype': (32, 4, 2179), 'template_all_atom_masks': (32, 4, 2179, 37), 'template_all_atom_positions': (32, 4, 2179, 37, 3), 'template_sum_probs': (32, 4, 1), 'is_distillation': (32,), 'seq_mask': (32, 2179), 'msa_mask': (32, 508, 2179), 'msa_row_mask': (32, 508), 'random_crop_to_size_seed': (32, 2), 'template_mask': (32, 4), 'template_pseudo_beta': (32, 4, 2179, 3), 'template_pseudo_beta_mask': (32, 4, 2179), 'atom14_atom_exists': (32, 2179, 14), 'residx_atom14_to_atom37': (32, 2179, 14), 'residx_atom37_to_atom14': (32, 2179, 37), 'atom37_atom_exists': (32, 2179, 37), 'extra_msa': (32, 5120, 2179), 'extra_msa_mask': (32, 5120, 2179), 'extra_msa_row_mask': (32, 5120), 'bert_mask': (32, 508, 2179), 'true_msa': (32, 508, 2179), 'extra_has_deletion': (32, 5120, 2179), 'extra_deletion_value': (32, 5120, 2179), 'msa_feat': (32, 508, 2179, 49), 'target_feat': (32, 2179, 22)}
2021-08-13 13:49:42.988439: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:457] Allocator (GPU_0_bfc) ran out of memory trying to allocate 39.13GiB (rounded to 42012920064)requested by op 
2021-08-13 13:49:42.991276: W external/org_tensorflow/tensorflow/core/common_runtime/bfc_allocator.cc:468] *******************************************************_____________________________________________
2021-08-13 13:49:42.991431: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2040] Execution of replica 0 failed: Resource exhausted: Out of memory while trying to allocate 42012919928 bytes.
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
visible gpus [PhysicalDevice(name='/physical_device:GPU:0', device_type='GPU'), PhysicalDevice(name='/physical_device:GPU:1', device_type='GPU')]
running process_features
2021-08-13 13:45:15 running: process_features
Traceback (most recent call last):
  File "run_alphafold.ano.py", line 328, in <module>
    app.run(main)
  File "$HOME/.local/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "$HOME/.local/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "run_alphafold.ano.py", line 301, in main
    predict_structure(
  File "run_alphafold.ano.py", line 162, in predict_structure
    prediction_result = model_runner.predict(processed_feature_dict)
  File "$HOME/alphafold/alphafold-2.0/alphafold/model/model.py", line 147, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 183, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/_src/api.py", line 399, in cache_miss
    out_flat = xla.xla_call(
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1561, in bind
    return call_bind(self, fun, *args, **params)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1552, in call_bind
    outs = primitive.process(top_trace, fun, tracers, params)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1564, in process
    return trace.process_call(self, fun, tracers, params)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 607, in process_call
    return primitive.impl(f, *tracers, **params)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 610, in _xla_call_impl
    return compiled_fun(*args)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 898, in _execute_compiled
    out_bufs = compiled.execute(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Resource exhausted: Out of memory while trying to allocate 42012919928 bytes.

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "run_alphafold.ano.py", line 328, in <module>
    app.run(main)
  File "$HOME/.local/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "$HOME/.local/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "run_alphafold.ano.py", line 301, in main
    predict_structure(
  File "run_alphafold.ano.py", line 162, in predict_structure
    prediction_result = model_runner.predict(processed_feature_dict)
  File "$HOME/alphafold/alphafold-2.0/alphafold/model/model.py", line 147, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
  File "$HOME/.conda/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 898, in _execute_compiled
    out_bufs = compiled.execute(input_bufs)
RuntimeError: Resource exhausted: Out of memory while trying to allocate 42012919928 bytes.

here is nvidia-smi output:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 460.32.03    Driver Version: 460.32.03    CUDA Version: 11.2     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  On   | 00000000:00:0A.0 Off |                  Off |
| N/A   39C    P0    67W / 300W |  29754MiB / 32510MiB |     62%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  Tesla V100-SXM2...  On   | 00000000:00:0B.0 Off |                  Off |
| N/A   39C    P0    55W / 300W |    496MiB / 32510MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A     52643      C   python                          29749MiB |
|    1   N/A  N/A     52643      C   python                            491MiB |
+-----------------------------------------------------------------------------+

kalininalab / alphafold_non_docker

alphafold only uses one gpu #10