Got the following error while running the same settings I've been running for multimers.
Randomizing MSAs
2024-06-10 05:24:29.724823: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: INTERNAL: Failed to launch CUDA kernel: fusion_3634 with block dimensions: 32x1x1 and grid dimensions: 5392x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
File "AF_multitemplate/run_alphafold.py", line 522, in <module>
app.run(main)
File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "AF_multitemplate/run_alphafold.py", line 506, in main
models_to_relax=FLAGS.models_to_relax)
File "AF_multitemplate/run_alphafold.py", line 281, in predict_structure
random_seed=model_random_seed)
File "/home/ubuntu/AFsample2/AF_multitemplate/alphafold/model/model.py", line 167, in predict
result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
return fun(*args, **kwargs)
File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/jax/_src/api.py", line 623, in cache_miss
out_flat = call_bind_continuation(execute(*args_flat))
File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/jax/_src/dispatch.py", line 895, in _execute_compiled
out_flat = compiled.execute(in_flat)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to launch CUDA kernel: fusion_3634 with block dimensions: 32x1x1 and grid dimensions: 5392x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.
--------------------
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File "AF_multitemplate/run_alphafold.py", line 522, in <module>
app.run(main)
File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/absl/app.py", line 312, in run
_run_main(main, args)
File "/root/miniconda3/envs/afsample/lib/python3.7/site-packages/absl/app.py", line 258, in _run_main
sys.exit(main(argv))
File "AF_multitemplate/run_alphafold.py", line 506, in main
models_to_relax=FLAGS.models_to_relax)
File "AF_multitemplate/run_alphafold.py", line 281, in predict_structure
random_seed=model_random_seed)
File "/home/ubuntu/AFsample2/AF_multitemplate/alphafold/model/model.py", line 167, in predict
result = self.apply(self.params, jax.random.PRNGKey(random_seed), feat)
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to launch CUDA kernel: fusion_3634 with block dimensions: 32x1x1 and grid dimensions: 5392x1x1: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2024-06-10 05:24:29.950905: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1043] could not synchronize on CUDA context: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered :: *** Begin stack trace ***
_PyGC_CollectNoFail
PyImport_Cleanup
Py_FinalizeEx
_Py_UnixMain
__libc_start_main
*** End stack trace ***
2024-06-10 05:24:29.951925: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:150] Check failed: pair.first->SynchronizeAllActivity()
Fatal Python error: Aborted
Current thread 0x00007f847a54f740 (most recent call first):
Aborted (core dumped)
Got the following error while running the same settings I've been running for multimers.