iamysk / AFsample2

Modelling protein conformational landscape with Alphafold
Apache License 2.0
34 stars 5 forks source link

Potential Internal Error #1

Open denizkavi opened 4 months ago

denizkavi commented 4 months ago

Got the following error while running the same settings I've been running for multimers.

Randomizing MSAs
2024-06-10 05:24:29.724823: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: INTERNAL: Failed to launch CUDA kernel: fusion_3634 with block dimensions: 32x1x1 and grid dimensions: 5392x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "AF_multitemplate/run_alphafold.py", line 522, in <module>
    app.run(main)
  File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "AF_multitemplate/run_alphafold.py", line 506, in main
    models_to_relax=FLAGS.models_to_relax)
  File "AF_multitemplate/run_alphafold.py", line 281, in predict_structure
    random_seed=model_random_seed)
  File "/home/ubuntu/AFsample2/AF_multitemplate/alphafold/model/model.py", line 167, in predict
    result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
  File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/jax/_src/api.py", line 623, in cache_miss
    out_flat = call_bind_continuation(execute(*args_flat))
  File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/jax/_src/dispatch.py", line 895, in _execute_compiled
    out_flat = compiled.execute(in_flat)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to launch CUDA kernel: fusion_3634 with block dimensions: 32x1x1 and grid dimensions: 5392x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "AF_multitemplate/run_alphafold.py", line 522, in <module>
    app.run(main)
  File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "AF_multitemplate/run_alphafold.py", line 506, in main
    models_to_relax=FLAGS.models_to_relax)
  File "AF_multitemplate/run_alphafold.py", line 281, in predict_structure
    random_seed=model_random_seed)
  File "/home/ubuntu/AFsample2/AF_multitemplate/alphafold/model/model.py", line 167, in predict
    result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to launch CUDA kernel: fusion_3634 with block dimensions: 32x1x1 and grid dimensions: 5392x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-06-10 05:24:29.950905: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1043] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***
        _PyGC_CollectNoFail
        PyImport_Cleanup
        Py_FinalizeEx

        _Py_UnixMain
        __libc_start_main

*** End stack trace ***

2024-06-10 05:24:29.951925: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:150] Check failed: pair.first->SynchronizeAllActivity() 
Fatal Python error: Aborted

Current thread 0x00007f847a54f740 (most recent call first):
Aborted (core dumped)
iamysk commented 4 months ago

This seems to be related to a jax version mismatch. Could you list your env packages or confirm if the following package versions are installed?

# Tested versions
jax==0.3.25
jaxlib==0.3.25+cuda11.cudnn82