kalininalab / alphafold_non_docker

AlphaFold2 non-docker setup
331 stars 119 forks source link

Execution of replica 0 failed: Internal: CUBLAS_STATUS_EXECUTION_FAILED #19

Closed avilella closed 1 year ago

avilella commented 2 years ago

This could be unrelated to this repo and instead be just some sort of drivers issue, but I'll post the error just in case someone can help.

We've installed this repo in an Ubuntu 21.04 Laptop with Thunderbolt and an eGPU with 2 Nvidia Quadro P1000 cards.

We kick off two parallel jobs, one on node 0 and another one on node 1, and they mostly go well, but after a few minutes/hours, sometimes one of the jobs gets stuck with the error below:

Any ideas wellcomed, thanks

I0930 07:04:52.753646 140368369108800 model.py:131] Running predict with shape(feat) = {'aatype': (4, 245), 'residue_index': (4, 245), 'seq_length': (4,), 'template_aatype': (4, 4, 245), 'template_all_atom_masks': (4, 4, 245, 37), 'template_all_atom_positions': (4, 4, 245, 37, 3), 'template_sum_probs': (4, 4, 1), 'is_distillation': (4,), 'seq_mask': (4, 245), 'msa_mask': (4, 508, 245), 'msa_row_mask': (4, 508), 'random_crop_to_size_seed': (4, 2), 'template_mask': (4, 4), 'template_pseudo_beta': (4, 4, 245, 3), 'template_pseudo_beta_mask': (4, 4, 245), 'atom14_atom_exists': (4, 245, 14), 'residx_atom14_to_atom37': (4, 245, 14), 'residx_atom37_to_atom14': (4, 245, 37), 'atom37_atom_exists': (4, 245, 37), 'extra_msa': (4, 5120, 245), 'extra_msa_mask': (4, 5120, 245), 'extra_msa_row_mask': (4, 5120), 'bert_mask': (4, 508, 245), 'true_msa': (4, 508, 245), 'extra_has_deletion': (4, 5120, 245), 'extra_deletion_value': (4, 5120, 245), 'msa_feat': (4, 508, 245, 49), 'target_feat': (4, 245, 22)}
2021-09-30 07:06:51.976947: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2040] Execution of replica 0 failed: Internal: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "/home/user/alphafold/run_alphafold.py", line 310, in <module>
    app.run(main)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/user/alphafold/run_alphafold.py", line 284, in main
    predict_structure(
  File "/home/user/alphafold/run_alphafold.py", line 149, in predict_structure
    prediction_result = model_runner.predict(processed_feature_dict)
  File "/home/user/alphafold/alphafold/model/model.py", line 133, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/_src/api.py", line 411, in cache_miss
    out_flat = xla.xla_call(
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1618, in bind
    return call_bind(self, fun, *args, **params)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1609, in call_bind
    outs = primitive.process(top_trace, fun, tracers, params)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 1621, in process
    return trace.process_call(self, fun, tracers, params)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/core.py", line 615, in process_call
    return primitive.impl(f, *tracers, **params)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 625, in _xla_call_impl
    out = compiled_fun(*args)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 960, in _execute_compiled
    out_bufs = compiled.execute(input_bufs)
jax._src.traceback_util.UnfilteredStackTrace: RuntimeError: Internal: CUBLAS_STATUS_EXECUTION_FAILED

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/home/user/alphafold/run_alphafold.py", line 310, in <module>
    app.run(main)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/home/user/alphafold/run_alphafold.py", line 284, in main
    predict_structure(
  File "/home/user/alphafold/run_alphafold.py", line 149, in predict_structure
    prediction_result = model_runner.predict(processed_feature_dict)
  File "/home/user/alphafold/alphafold/model/model.py", line 133, in predict
    result = self.apply(self.params, jax.random.PRNGKey(0), feat)
  File "/home/user/miniconda3/envs/alphafold/lib/python3.8/site-packages/jax/interpreters/xla.py", line 960, in _execute_compiled
    out_bufs = compiled.execute(input_bufs)
RuntimeError: Internal: CUBLAS_STATUS_EXECUTION_FAILED
2021-09-30 07:06:53.025348: E external/org_tensorflow/tensorflow/stream_executor/cuda/cuda_driver.cc:1039] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***

        PyDict_SetItem
        _PyModule_ClearDict
        PyImport_Cleanup
        Py_FinalizeEx
        Py_RunMain
        Py_BytesMain
        __libc_start_main

*** End stack trace ***

2021-09-30 07:06:53.025585: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:99] Check failed: pair.first->SynchronizeAllActivity() 
Fatal Python error: Aborted
sanjaysrikakulam commented 2 years ago

Hi @avilella

To troubleshoot we might have to eliminate few things from consideration.

1) Please check if the installed CUDA/cuDNN versions match the GPU. 2) Please reinstall the drivers. 3) Please check if the GPU's are broken/faulty (there may be some tools available, for example, https://github.com/wilicc/gpu-burn (never used this, so cannot comment about it, use it at your own risk)) 4) Test the AF2 runs using small protein sequences and see if this is successful (P1000's have 4GB memory, so we need to make sure that this is not due to the "memory ran out" issue)

At the moment I can think of only these things. Please check and let me know and I will try to troubleshoot as much as I can.

avilella commented 2 years ago

Thanks, I'll check it out.

(4) we can discard, as these never go higher than 3Gb for the jobs I am submitting.

(3) I am intrigued about the possibility that it's faulty GPUs: I'll swap the 2 cards for 2 slightly different cards and run on the same Ubuntu 21.04, same drivers, and hopefully this will clarify it (4) is a problem rather than (1) or (2).

Thanks for the detailed enumeration, I'll follow up with the results of the investigation in case it helps other people with the same problem.

On Fri, Oct 1, 2021 at 8:34 AM Sanjay Kumar Srikakulam < @.***> wrote:

Hi @avilella https://github.com/avilella

To troubleshoot we might have to eliminate few things from consideration.

  1. Please check if the installed CUDA/cuDNN versions match the GPU.
  2. Please reinstall the drivers.
  3. Please check if the GPU's are broken/faulty (there may be some tools available, for example, https://github.com/wilicc/gpu-burn (never used this, so cannot comment about it, use it at your own risk))
  4. Test the AF2 runs using small protein sequences and see if this is successful (P1000's have 4GB memory, so we need to make sure that this is not due to the "memory ran out" issue)

At the moment I can think of only these things. Please check and let me know and I will try to troubleshoot as much as I can.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/kalininalab/alphafold_non_docker/issues/19#issuecomment-931982779, or unsubscribe https://github.com/notifications/unsubscribe-auth/AABGSN2752KXVXVQH5VNCW3UEVQAHANCNFSM5FBU4DXQ . Triage notifications on the go with GitHub Mobile for iOS https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675 or Android https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub.