Open DrJesseHansen opened 1 year ago
your "python3 run_alphafold.py ..." command lacks several continuation slashes, presumably because you removed the paths? And you should show the actual error messages.
yes you are correct that the continuation slashes were removed. They are in the original script. Thank you.
Here is the end of the log file:
I0217 01:25:02.775132 22633130694464 templates.py:267] Found an exact template match 1v9d_A.
I0217 01:25:02.792168 22633130694464 pipeline.py:234] Uniref90 MSA size: 37 sequences.
I0217 01:25:02.793349 22633130694464 pipeline.py:235] BFD MSA size: 24 sequences.
I0217 01:25:02.794069 22633130694464 pipeline.py:236] MGnify MSA size: 2 sequences.
I0217 01:25:02.794656 22633130694464 pipeline.py:237] Final (deduplicated) MSA size: 60 sequences.
I0217 01:25:02.795226 22633130694464 pipeline.py:239] Total number of templates (NB: this can include bad templates and is later filtered to
top 4): 20.
I0217 01:25:02.796629 22633130694464 jackhmmer.py:133] Launching subprocess "/mnt/nfs/clustersw/Debian/bullseye/hmmer/3.3.2/bin/jackhmmer -o
/dev/null -A /tmp/tmpf00et6_e/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /tmp/tmpto0xd9_f.fast
a /nfs/scistore14/rcsb/alphafold.databases/uniprot/uniprot.fasta"
I0217 01:25:02.855708 22633130694464 utils.py:36] Started Jackhmmer (uniprot.fasta) query
I0217 02:01:25.690917 22633130694464 utils.py:40] Finished Jackhmmer (uniprot.fasta) query in 2182.834 seconds
I0217 02:01:26.236864 22633130694464 run_alphafold.py:190] Running model model_1_multimer_v2_pred_0 on 4a_trimer_4b_dimer_23k_trimer_39p_trim
er
I0217 02:01:26.238596 22633130694464 model.py:165] Running predict with shape(feat) = {'aatype': (4431,), 'residue_index': (4431,), 'seq_leng
th': (), 'msa': (893, 4431), 'num_alignments': (), 'template_aatype': (4, 4431), 'template_all_atom_mask': (4, 4431, 37), 'template_all_atom_
positions': (4, 4431, 37, 3), 'asym_id': (4431,), 'sym_id': (4431,), 'entity_id': (4431,), 'deletion_matrix': (893, 4431), 'deletion_mean': (
4431,), 'all_atom_mask': (4431, 37), 'all_atom_positions': (4431, 37, 3), 'assembly_num_chains': (), 'entity_mask': (4431,), 'num_templates':
(), 'cluster_bias_mask': (893,), 'bert_mask': (893, 4431), 'seq_mask': (4431,), 'msa_mask': (893, 4431)}
/nfs/scistore07/clustersw/debian/bullseye/cuda11.2/alphafold/2.2.4c/alphafold-2.2.4/alphafold/model/geometry/struct_of_array.py:136: FutureWa
rning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
flat_array_like, inner_treedef = jax.tree_flatten(array_like)
/nfs/scistore07/clustersw/debian/bullseye/cuda11.2/alphafold/2.2.4c/alphafold-2.2.4/alphafold/model/geometry/struct_of_array.py:209: FutureWa
rning: jax.tree_unflatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_unflatten instead.
value_dict[array_field] = jax.tree_unflatten(
2023-02-17 02:02:28.871775: E external/org_tensorflow/tensorflow/compiler/xla/shape_util.cc:311] INVALID_ARGUMENT: invalid shape type=11, dim
s=[-1781845888]
2023-02-17 02:02:28.879099: F external/org_tensorflow/tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling e
rror INVALID_ARGUMENT: invalid shape type=11, dims=[-1781845888]
Fatal Python error: Aborted
Current thread 0x00001495afb71740 (most recent call first):
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 994 in backend_
compile
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/profiler.py", line 313 in wrapper
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 1054 in compile
_or_get_cached
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 1136 in from_xl
a_computation
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 978 in compile
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 342 in _xla_cal
lable_uncached
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/linear_util.py", line 309 in memoized_f
un
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 234 in _xla_cal
l_impl
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/core.py", line 701 in process_call
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/core.py", line 1955 in call_bind
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/core.py", line 1939 in bind
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/api.py", line 606 in cache_miss
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 162 in re
raise_with_filtered_traceback
File "/nfs/scistore07/clustersw/debian/bullseye/cuda11.2/alphafold/2.2.4c/alphafold-2.2.4/alphafold/model/model.py", line 167 in predict
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/alphafold-2.2.4/run_alphafold.py", line 198 in predict_structure
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/alphafold-2.2.4/run_alphafold.py", line 398 in main
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/absl/app.py", line 258 in _run_main
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/absl/app.py", line 312 in run
File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/alphafold-2.2.4/run_alphafold.py", line 422 in <module>
/var/lib/slurm/slurmd/job2543698/slurm_script: line 39: 3334595 Aborted python3 /mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/
alphafold/2.2.4c/alphafold-2.2.4/run_alphafold.py --model_preset=multimer --fasta_paths=$MY_PROTEIN_PATH --output_dir=$(dirname $MY_PROTEIN_P
ATH) --data_dir=/nfs/scistore14/rcsb/alphafold.databases/ --mgnify_database_path=/nfs/scistore14/rcsb/alphafold.databases/mgnify/mgy_clusters
_2018_12.fa --template_mmcif_dir=/nfs/scistore14/rcsb/alphafold.databases.v2/pdb_mmcif/mmcif_files/ --max_template_date=2020-05-14 --obsolete
_pdbs_path=/nfs/scistore14/rcsb/alphafold.databases.v2/pdb_mmcif/obsolete.dat --use_gpu_relax=true --bfd_database_path=/nfs/scistore14/rcsb/a
lphafold.databases/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --uniclust30_database_path=/nfs/scistore14/rcsb/alphafold.dat
abases/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --uniref90_database_path=/nfs/scistore14/rcsb/alphafold.databases/uniref90/uniref90.f
asta --pdb_seqres_database_path=/nfs/scistore14/rcsb/alphafold.databases/pdb_seqres/pdb_seqres.txt --uniprot_database_path=/nfs/scistore14/rc
sb/alphafold.databases/uniprot/uniprot.fasta
note that I've tried running this on various GPUs and none work. This includes A40's which have 40 GB vram.
dear all,
Some of my AF2 jobs with larger sequences are failing, for what I can only assume is memory issues. I have discovered it is possible to set up unified memory by adding the two flags below to my SLURM submission script, however the issue persists. I have been working with our IT department but we cannot resolve it. Is there any advice on what I might be doing wrong in my submission script, or other things I can check? thank you.
flags to add for unified memory:
my script (note I have removed actual paths from the script):