Execution of replica 0 failed: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

Hi, I'm running alphafold 2.3.2 through singularity

/usr/bin/time run_alphafold_singularity.py --fasta-paths protein.fasta --output-dir ./norelax_2.3.2 --cpus=$SLURM_CPUS_PER_TASK --gpu-devices cuda:${CUDA_VISIBLE_DEVICES} --model-preset multimer --data-dir /scratch/mirror/alphafold/2.3.2 --db-preset full_dbs

Someone had a similar issue here during the relaxation, but this does not seem to be the issue here since I'm running alphafold with out.

I'm getting the following error:

INFO:    Using cached SIF image
INFO:    Converting SIF file to temporary sandbox...
WARNING: underlay of /usr/bin/nvidia-smi required more than 50 (464) bind mounts
/bin/bash: /opt/conda/lib/libtinfo.so.6: no version information available (required by /bin/bash)
/sbin/ldconfig.real: Can't create temporary cache file /etc/ld.so.cache~: Read-only file system
I0913 09:08:19.261979 22564441192256 templates.py:857] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat.
I0913 09:08:19.787762 22564441192256 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0913 09:08:20.687740 22564441192256 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Interpreter CUDA Host
I0913 09:08:20.688797 22564441192256 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0913 09:08:20.689194 22564441192256 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
I0913 09:08:25.525992 22564441192256 run_alphafold.py:386] Have 25 models: ['model_1_multimer_v3_pred_0', 'model_1_multimer_v3_pred_1', 'model_1_multimer_v3_pred_2', 'model_1_multimer_v3_pred_3', 'model_1_multimer_v3_pred_4', 'model_2_multimer_v3_pred_0', 'model_2_multimer_v3_pred_1', 'model_2_multimer_v3_pred_2', 'model_2_multimer_v3_pred_3', 'model_2_multimer_v3_pred_4', 'model_3_multimer_v3_pred_0', 'model_3_multimer_v3_pred_1', 'model_3_multimer_v3_pred_2', 'model_3_multimer_v3_pred_3', 'model_3_multimer_v3_pred_4', 'model_4_multimer_v3_pred_0', 'model_4_multimer_v3_pred_1', 'model_4_multimer_v3_pred_2', 'model_4_multimer_v3_pred_3', 'model_4_multimer_v3_pred_4', 'model_5_multimer_v3_pred_0', 'model_5_multimer_v3_pred_1', 'model_5_multimer_v3_pred_2', 'model_5_multimer_v3_pred_3', 'model_5_multimer_v3_pred_4']
I0913 09:08:25.526679 22564441192256 run_alphafold.py:403] Using random seed 146838872409730248 for the data pipeline
I0913 09:08:25.527175 22564441192256 run_alphafold.py:161] Predicting protein
I0913 09:08:25.533941 22564441192256 pipeline_multimer.py:210] Running monomer pipeline on chain A: GNAMCCOK_01379
I0913 09:08:25.534407 22564441192256 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/slurm-8158753/tmpygxoxuhr/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 2 -N 1 /tmp/slurm-8158753/tmpw3mv7qt_.fasta /mnt/uniref90_database_path/uniref90.fasta"
I0913 09:08:25.554061 22564441192256 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0913 09:59:21.117203 22564441192256 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 3055.562 seconds
I0913 09:59:29.292488 22564441192256 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/slurm-8158753/tmpu2fmoupr/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 2 -N 1 /tmp/slurm-8158753/tmpw3mv7qt_.fasta /mnt/mgnify_database_path/mgy_clusters_2022_05.fa"
I0913 09:59:29.316004 22564441192256 utils.py:36] Started Jackhmmer (mgy_clusters_2022_05.fa) query
I0913 11:37:09.206206 22564441192256 utils.py:40] Finished Jackhmmer (mgy_clusters_2022_05.fa) query in 5859.889 seconds
I0913 11:37:49.755560 22564441192256 hmmbuild.py:121] Launching subprocess ['/usr/bin/hmmbuild', '--hand', '--amino', '/tmp/slurm-8158753/tmplzzql83_/output.hmm', '/tmp/slurm-8158753/tmplzzql83_/query.msa']
I0913 11:37:49.786695 22564441192256 utils.py:36] Started hmmbuild query
I0913 11:37:52.996394 22564441192256 hmmbuild.py:128] hmmbuild stdout:
# hmmbuild :: profile HMM construction from multiple sequence alignments
# HMMER 3.3 (Nov 2019); http://hmmer.org/
# Copyright (C) 2019 Howard Hughes Medical Institute.
# Freely distributed under the BSD open source license.
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
# input alignment file:             /tmp/slurm-8158753/tmplzzql83_/query.msa
# output HMM file:                  /tmp/slurm-8158753/tmplzzql83_/output.hmm
# input alignment is asserted as:   protein
# model architecture construction:  hand-specified by RF annotation
# - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -

# idx name                  nseq  alen  mlen eff_nseq re/pos description
#---- -------------------- ----- ----- ----- -------- ------ -----------
1     query                 9986  9298  1201     8.36  0.590

# CPU time: 2.95u 0.20s 00:00:03.15 Elapsed: 00:00:03.18

stderr:

I0913 11:37:52.997251 22564441192256 utils.py:40] Finished hmmbuild query in 3.210 seconds
I0913 11:37:53.011297 22564441192256 hmmsearch.py:103] Launching sub-process ['/usr/bin/hmmsearch', '--noali', '--cpu', '8', '--F1', '0.1', '--F2', '0.1', '--F3', '0.1', '--incE', '100', '-E', '100', '--domE', '100', '--incdomE', '100', '-A', '/tmp/slurm-8158753/tmpurn7irdg/output.sto', '/tmp/slurm-8158753/tmpurn7irdg/query.hmm', '/mnt/pdb_seqres_database_path/pdb_seqres.txt']
I0913 11:37:53.042010 22564441192256 utils.py:36] Started hmmsearch (pdb_seqres.txt) query
I0913 11:41:34.329584 22564441192256 utils.py:40] Finished hmmsearch (pdb_seqres.txt) query in 221.287 seconds
I0913 11:41:47.544001 22564441192256 hhblits.py:128] Launching subprocess "/usr/bin/hhblits -i /tmp/slurm-8158753/tmpw3mv7qt_.fasta -cpu 2 -oa3m /tmp/slurm-8158753/tmpkvpipjro/output.a3m -o /dev/null -n 3 -e 0.001 -maxseq 1000000 -realign_max 100000 -maxfilt 100000 -min_prefilter_hits 1000 -d /mnt/bfd_database_path/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt -d /mnt/uniref30_database_path/UniRef30_2021_03"
I0913 11:41:47.581761 22564441192256 utils.py:36] Started HHblits query
I0913 16:07:07.767602 22564441192256 utils.py:40] Finished HHblits query in 15920.172 seconds
I0913 16:07:08.922005 22564441192256 templates.py:940] Searching for template for: MVHINKMICIGFKSFRKKTVINFDKGFSAIVGANGSGKSNIIDAFVFVLGALSAKTLRATNIKDLISNGGNGLGPAQTASVEIVFDNKDGAFGLGESEIRILRKIDRKGNGIYRLNDKRSTRKEIVSLLDLAGIIPNSSNMIMQGELFRLINMNSTQRRELVEDIAGISSYNERKLSAEDELVKVQTNLGQISLLLNEVYIQLEQLKKEKEDAEKYLAVVEQEKIRNNALYQVKINSAVKNISDMALQKQDIEAQIGDINCLEQDLNEQISQLEMQIEDLNPKIQALQDEELLQMTYRMKELKNKITEFRTSLKYANKNLLTYEKEQKDLQLRLVDLQKQEESLKIEIIDIEKNKQTIQEKIDSKNAEISDFEDNLQKIDVEYTQIKEESKNIRIQINNVKEDKAEISTSIKVLENQISTMKNDKIKNEKKIFENHEHLGKMKVHLKELEHEEKVKLGLTDFDDVSKSGMEKKISQLNQENIHIQEKLKKIKPLATETQKSIFEIKSRIKVVKQMNSGNRALKAIKKLQNSGKITGIHGTIAELGSIDPKFAIAMEMAAGSRFNFVVVDNQEIGEQCINYLKQNKIGRASFIPLDEIRYSSFNLSISRDPKIYGRAVDLITFDQKYFHAFEYVFGRTIIVEDLPTARHLKVSAKRVTLDGDVIDGSNLMSGGQKNKPKGIGFKGTNDEETKVVDLEYNYNKFKNEIDALELKFKSNAGEISRLYQLKISGANKTKEINEKIAICKSNIQTLQASIKTLEVEIEDILISIKELECKLETLNASLSEVEQKLFSLNEKEHSIQEKLDSSEESVLKQQLRNAEKELKKLTKIASQIEIEYTKKMSTLTETISNGRKEAQQQLNVRSISISETNLSISTFESDLKKTEIEASALDEKIVQKSAVVANLLNNKKNLQFEVSEKKTSIGQLNNDRYPLKVKLNTFEIKSSELDMKIQEWKCHILPEILIPQEFLSLSESKHQLEIEKLLDQKNSLGAVNLRAIEKYSEIQARFTELEQKNEQVIQEREAILQFIEALESEKLKVFMNTFNAINANFGYIFSRLSPNGEARLELENLEDPFAGGVQIVARPGEKEKCNVMALSGGERTLTIIALILGIQMHVPSPYYILDEIDAALDDVNAALVADMIKELSEKSQFIIITHRDVTMARVDHLLGVSNIEGVTSVINLSIKKVLQELIKGETPLEESV
I0913 16:07:10.123861 22564441192256 templates.py:267] Found an exact template match 3zgx_A.
I0913 16:07:10.191119 22564441192256 templates.py:267] Found an exact template match 3zgx_A.
I0913 16:07:33.740917 22564441192256 templates.py:718] hit 7ogt_B did not pass prefilter: Proportion of residues aligned to query too small. Align ratio: 0.0974188176519567.
I0913 16:07:36.035511 22564441192256 templates.py:267] Found an exact template match 6wg3_B.
I0913 16:07:36.070983 22564441192256 templates.py:718] hit 6wg3_B did not pass prefilter: Proportion of residues aligned to query too small. Align ratio: 0.0890924229808493.
I0913 16:07:37.089503 22564441192256 templates.py:267] Found an exact template match 6wge_B.
I0913 16:07:37.117515 22564441192256 templates.py:718] hit 6wge_B did not pass prefilter: Proportion of residues aligned to query too small. Align ratio: 0.0890924229808493.
I0913 16:07:37.512340 22564441192256 templates.py:267] Found an exact template match 4i99_A.
I0913 16:07:40.114441 22564441192256 pipeline.py:234] Uniref90 MSA size: 10000 sequences.
I0913 16:07:40.114935 22564441192256 pipeline.py:235] BFD MSA size: 3695 sequences.
I0913 16:07:40.115195 22564441192256 pipeline.py:236] MGnify MSA size: 501 sequences.
I0913 16:07:40.115463 22564441192256 pipeline.py:237] Final (deduplicated) MSA size: 14143 sequences.
I0913 16:07:40.115876 22564441192256 pipeline.py:239] Total number of templates (NB: this can include bad templates and is later filtered to top 4): 20.
I0913 16:07:40.490063 22564441192256 run_alphafold.py:191] Running model model_1_multimer_v3_pred_0 on protein
I0913 16:07:40.492696 22564441192256 model.py:165] Running predict with shape(feat) = {'aatype': (2402,), 'residue_index': (2402,), 'seq_length': (), 'msa': (2048, 2402), 'num_alignments': (), 'template_aatype': (4, 2402), 'template_all_atom_mask': (4, 2402, 37), 'template_all_atom_positions': (4, 2402, 37, 3), 'asym_id': (2402,), 'sym_id': (2402,), 'entity_id': (2402,), 'deletion_matrix': (2048, 2402), 'deletion_mean': (2402,), 'all_atom_mask': (2402, 37), 'all_atom_positions': (2402, 37, 3), 'assembly_num_chains': (), 'entity_mask': (2402,), 'num_templates': (), 'cluster_bias_mask': (2048,), 'bert_mask': (2048, 2402), 'seq_mask': (2402,), 'msa_mask': (2048, 2402)}
I0914 07:06:31.314984 22564441192256 model.py:175] Output shape was {'distogram': {'bin_edges': (63,), 'logits': (2402, 2402, 64)}, 'experimentally_resolved': {'logits': (2402, 37)}, 'masked_msa': {'logits': (508, 2402, 22)}, 'num_recycles': (), 'predicted_aligned_error': (2402, 2402), 'predicted_lddt': {'logits': (2402, 50)}, 'structure_module': {'final_atom_mask': (2402, 37), 'final_atom_positions': (2402, 37, 3)}, 'plddt': (2402,), 'aligned_confidence_probs': (2402, 2402, 64), 'max_predicted_aligned_error': (), 'ptm': (), 'iptm': (), 'ranking_confidence': ()}
I0914 07:06:31.337612 22564441192256 run_alphafold.py:203] Total JAX model model_1_multimer_v3_pred_0 on protein predict time (includes compilation time, see --benchmark): 53930.8s
I0914 07:08:48.876144 22564441192256 amber_minimize.py:177] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {<Residue 1200 (VAL) of chain 0>: ['OXT'], <Residue 2401 (VAL) of chain 1>: ['OXT']}, 'Se_in_MET': [], 'removed_chains': {0: []}}
I0914 07:08:55.371805 22564441192256 amber_minimize.py:407] Minimizing protein, attempt 1 of 100.
I0914 07:09:02.013652 22564441192256 amber_minimize.py:68] Restraining 19120 / 38856 particles.
I0914 07:29:15.688822 22564441192256 amber_minimize.py:177] alterations info: {'nonstandard_residues': [], 'removed_heterogens': set(), 'missing_residues': {}, 'missing_heavy_atoms': {}, 'missing_terminals': {}, 'Se_in_MET': [], 'removed_chains': {0: []}}
2023-09-14 07:29:54.584400: E external/org_tensorflow/tensorflow/compiler/xla/pjrt/pjrt_stream_executor_client.cc:2153] Execution of replica 0 failed: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
Traceback (most recent call last):
  File "/app/alphafold/run_alphafold.py", line 432, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/app/alphafold/run_alphafold.py", line 408, in main
    predict_structure(
  File "/app/alphafold/run_alphafold.py", line 243, in predict_structure
    relaxed_pdb_str, _, violations = amber_relaxer.process(
  File "/app/alphafold/alphafold/relax/relax.py", line 62, in process
    out = amber_minimize.run_pipeline(
  File "/app/alphafold/alphafold/relax/amber_minimize.py", line 489, in run_pipeline
    ret.update(get_violation_metrics(prot))
  File "/app/alphafold/alphafold/relax/amber_minimize.py", line 357, in get_violation_metrics
    structural_violations, struct_metrics = find_violations(prot)
  File "/app/alphafold/alphafold/relax/amber_minimize.py", line 339, in find_violations
    violations = folding.find_structural_violations(
  File "/app/alphafold/alphafold/model/folding.py", line 761, in find_structural_violations
    between_residue_clashes = all_atom.between_residue_clash_loss(
  File "/app/alphafold/alphafold/model/all_atom.py", line 783, in between_residue_clash_loss
    dists = jnp.sqrt(1e-10 + jnp.sum(
  File "/opt/conda/lib/python3.8/site-packages/jax/_src/numpy/reductions.py", line 216, in sum
    return _reduce_sum(a, axis=_ensure_optional_axes(axis), dtype=dtype, out=out,
  File "/opt/conda/lib/python3.8/site-packages/jax/_src/traceback_util.py", line 162, in reraise_with_filtered_traceback
    return fun(*args, **kwargs)
  File "/opt/conda/lib/python3.8/site-packages/jax/_src/api.py", line 623, in cache_miss
    out_flat = call_bind_continuation(execute(*args_flat))
  File "/opt/conda/lib/python3.8/site-packages/jax/_src/dispatch.py", line 895, in _execute_compiled
    out_flat = compiled.execute(in_flat)
jax._src.traceback_util.UnfilteredStackTrace: jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered

The stack trace below excludes JAX-internal frames.
The preceding is the original exception that occurred, unmodified.

--------------------

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/app/alphafold/run_alphafold.py", line 432, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/app/alphafold/run_alphafold.py", line 408, in main
    predict_structure(
  File "/app/alphafold/run_alphafold.py", line 243, in predict_structure
    relaxed_pdb_str, _, violations = amber_relaxer.process(
  File "/app/alphafold/alphafold/relax/relax.py", line 62, in process
    out = amber_minimize.run_pipeline(
  File "/app/alphafold/alphafold/relax/amber_minimize.py", line 489, in run_pipeline
    ret.update(get_violation_metrics(prot))
  File "/app/alphafold/alphafold/relax/amber_minimize.py", line 357, in get_violation_metrics
    structural_violations, struct_metrics = find_violations(prot)
  File "/app/alphafold/alphafold/relax/amber_minimize.py", line 339, in find_violations
    violations = folding.find_structural_violations(
  File "/app/alphafold/alphafold/model/folding.py", line 761, in find_structural_violations
    between_residue_clashes = all_atom.between_residue_clash_loss(
  File "/app/alphafold/alphafold/model/all_atom.py", line 783, in between_residue_clash_loss
    dists = jnp.sqrt(1e-10 + jnp.sum(
  File "/opt/conda/lib/python3.8/site-packages/jax/_src/numpy/reductions.py", line 216, in sum
    return _reduce_sum(a, axis=_ensure_optional_axes(axis), dtype=dtype, out=out,
jaxlib.xla_extension.XlaRuntimeError: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-09-14 07:29:56.941169: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-09-14 07:29:56.941774: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:1032] could not wait stream on event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-09-14 07:29:56.942014: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:1049] Error waiting for event in stream: error recording waiting for CUDA event on stream 0x558713f90ad0; not marking stream as bad, as the Event object may be at fault. Monitor for further errors.
2023-09-14 07:29:56.942268: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:695] could not allocate CUDA stream for context 0x5587036232b0: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-09-14 07:29:56.942598: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization
2023-09-14 07:29:56.942749: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:614] unable to add host callback: CUDA_ERROR_INVALID_HANDLE: invalid resource handle
2023-09-14 07:29:56.942900: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_event.cc:29] Error polling for event status: failed to query event: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-09-14 07:29:56.943028: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:695] could not allocate CUDA stream for context 0x5587036232b0: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-09-14 07:29:56.943238: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization
2023-09-14 07:29:56.943573: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/cuda/cuda_driver.cc:695] could not allocate CUDA stream for context 0x5587036232b0: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered
2023-09-14 07:29:56.943724: E external/org_tensorflow/tensorflow/compiler/xla/stream_executor/stream.cc:297] failed to allocate stream during initialization

    _PyGC_CollectNoFail
    PyImport_Cleanup
    Py_FinalizeEx
    Py_RunMain
    Py_BytesMain
    __libc_start_main

*** End stack trace ***

2023-09-14 07:29:57.302438: F external/org_tensorflow/tensorflow/compiler/xla/service/gpu/gpu_executable.cc:150] Check failed: pair.first->SynchronizeAllActivity()
Fatal Python error: Aborted

Current thread 0x00001485b1807740 (most recent call first):
<no Python frame>
/app/run_alphafold.sh: line 3: 2843920 Aborted                 (core dumped) python /app/alphafold/run_alphafold.py "$@"
INFO:    Cleaning up image...
Traceback (most recent call last):
  File "/home/apps/alphafold/2.3.2/bin/run_alphafold_singularity.py", line 280, in <module>
    main()
  File "/home/apps/alphafold/2.3.2/bin/run_alphafold_singularity.py", line 148, in main
    p.check_returncode()
  File "/usr/lib64/python3.6/subprocess.py", line 389, in check_returncode
    self.stderr)
subprocess.CalledProcessError: Command '['singularity', 'exec', '--nv', '--bind', '/srv/scratch/user/alphafold_ex:/mnt/fasta_path_0:ro,/srv/scratch/mirror/alphafold/2.3.2/uniref90:/mnt/uniref90_database_path:ro,/srv/scratch/mirror/alphafold/2.3.2/mgnify:/mnt/mgnify_database_path:ro,/srv/scratch/mirror/alphafold:/mnt/data_dir:ro,/srv/scratch/mirror/alphafold/2.3.2/pdb_mmcif:/mnt/template_mmcif_dir:ro,/srv/scratch/mirror/alphafold/2.3.2/pdb_mmcif:/mnt/obsolete_pdbs_path:ro,/srv/scratch/mirror/alphafold/2.3.2/uniprot:/mnt/uniprot_database_path:ro,/srv/scratch/mirror/alphafold/2.3.2/pdb_seqres:/mnt/pdb_seqres_database_path:ro,/srv/scratch/mirror/alphafold/2.3.2/uniref30:/mnt/uniref30_database_path:ro,/srv/scratch/mirror/alphafold/2.3.2/bfd:/mnt/bfd_database_path:ro,/srv/scratch/user/alphafold_ex:/mnt/output:rw', '--env="NVIDIA_VISIBLE_DEVICES=cuda:3"', '--env="TF_FORCE_UNIFIED_MEMORY=1"', '--env="XLA_PYTHON_CLIENT_MEM_FRACTION=4.0"', '--env="OPENMM_CPU_THREADS=2"', '--env="MAX_CPUS=2"', '--env="LD_LIBRARY_PATH=/opt/conda/lib:/usr/local/cuda-11.2/targets/x86_64-linux/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/.singularity.d/libs"', 'docker://catgumag/alphafold:2.3.0', '/app/run_alphafold.sh', '--fasta_paths=/mnt/fasta_path_0/protein.fasta', '--uniref90_database_path=/mnt/uniref90_database_path/uniref90.fasta', '--mgnify_database_path=/mnt/mgnify_database_path/mgy_clusters_2022_05.fa', '--data_dir=/mnt/data_dir/2.3.2', '--template_mmcif_dir=/mnt/template_mmcif_dir/mmcif_files', '--obsolete_pdbs_path=/mnt/obsolete_pdbs_path/obsolete.dat', '--uniprot_database_path=/mnt/uniprot_database_path/uniprot.fasta', '--pdb_seqres_database_path=/mnt/pdb_seqres_database_path/pdb_seqres.txt', '--uniref30_database_path=/mnt/uniref30_database_path/UniRef30_2021_03', '--bfd_database_path=/mnt/bfd_database_path/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt', '--output_dir=/mnt/output/norelax_2.3.2', '--max_template_date=2023-09-13', '--db_preset=full_dbs', '--model_preset=multimer', '--benchmark=False', '--use_precomputed_msas=False', '--num_multimer_predictions_per_model=5', '--run_relax=True', '--use_gpu_relax=True', '--logtostderr']' returned non-zero exit status 134.
Command exited with non-zero status 1
60282.35user 33447.38system 22:22:51elapsed 116%CPU (0avgtext+0avgdata 80461712maxresident)k
608315346inputs+42660683outputs (1193402major+119210482minor)pagefaults 0swaps

google-deepmind / alphafold

Execution of replica 0 failed: INTERNAL: Failed to load in-memory CUBIN: CUDA_ERROR_ILLEGAL_ADDRESS: an illegal memory access was encountered #831