Enabling Unified Memory on Compute Cluster

DrJesseHansen commented 1 year ago

dear all,

Some of my AF2 jobs with larger sequences are failing, for what I can only assume is memory issues. I have discovered it is possible to set up unified memory by adding the two flags below to my SLURM submission script, however the issue persists. I have been working with our IT department but we cannot resolve it. Is there any advice on what I might be doing wrong in my submission script, or other things I can check? thank you.

flags to add for unified memory:

export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"

my script (note I have removed actual paths from the script):

#!/bin/bash
#
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=8
#SBATCH --mem=200GB
#
#SBATCH --time=36:00:00
#SBATCH --no-requeue
#
#SBATCH --partition=gpu
#SBATCH --gres=gpu:4

MY_PROTEIN_PATH=(my_fasta_location.fasta)

echo $HOSTNAME

module load alphafold/2.2.4c

export OPENMM_CUDA_COMPILER=$(which nvcc)
export TF_FORCE_UNIFIED_MEMORY=1
export XLA_PYTHON_CLIENT_MEM_FRACTION="4.0"

python3 run_alphafold.py \
    --model_preset=multimer \
    --fasta_paths=$MY_PROTEIN_PATH \
    --output_dir=$(dirname $MY_PROTEIN_PATH) \
    --data_dir= \
    --mgnify_database_path= \
    --template_mmcif_dir= \
    --max_template_date=2020-05-14 \
    --obsolete_pdbs_path= \
    --use_gpu_relax=true \
    --bfd_database_path= \
    --uniclust30_database_path= \
    --uniref90_database_path= \
    --pdb_seqres_database_path= \
    --uniprot_database_path=

KayDiederichs commented 1 year ago

your "python3 run_alphafold.py ..." command lacks several continuation slashes, presumably because you removed the paths? And you should show the actual error messages.

DrJesseHansen commented 1 year ago

yes you are correct that the continuation slashes were removed. They are in the original script. Thank you.

Here is the end of the log file:

I0217 01:25:02.775132 22633130694464 templates.py:267] Found an exact template match 1v9d_A.
I0217 01:25:02.792168 22633130694464 pipeline.py:234] Uniref90 MSA size: 37 sequences.
I0217 01:25:02.793349 22633130694464 pipeline.py:235] BFD MSA size: 24 sequences.
I0217 01:25:02.794069 22633130694464 pipeline.py:236] MGnify MSA size: 2 sequences.
I0217 01:25:02.794656 22633130694464 pipeline.py:237] Final (deduplicated) MSA size: 60 sequences.
I0217 01:25:02.795226 22633130694464 pipeline.py:239] Total number of templates (NB: this can include bad templates and is later filtered to 
top 4): 20.
I0217 01:25:02.796629 22633130694464 jackhmmer.py:133] Launching subprocess "/mnt/nfs/clustersw/Debian/bullseye/hmmer/3.3.2/bin/jackhmmer -o 
/dev/null -A /tmp/tmpf00et6_e/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /tmp/tmpto0xd9_f.fast
a /nfs/scistore14/rcsb/alphafold.databases/uniprot/uniprot.fasta"
I0217 01:25:02.855708 22633130694464 utils.py:36] Started Jackhmmer (uniprot.fasta) query
I0217 02:01:25.690917 22633130694464 utils.py:40] Finished Jackhmmer (uniprot.fasta) query in 2182.834 seconds
I0217 02:01:26.236864 22633130694464 run_alphafold.py:190] Running model model_1_multimer_v2_pred_0 on 4a_trimer_4b_dimer_23k_trimer_39p_trim
er
I0217 02:01:26.238596 22633130694464 model.py:165] Running predict with shape(feat) = {'aatype': (4431,), 'residue_index': (4431,), 'seq_leng
th': (), 'msa': (893, 4431), 'num_alignments': (), 'template_aatype': (4, 4431), 'template_all_atom_mask': (4, 4431, 37), 'template_all_atom_
positions': (4, 4431, 37, 3), 'asym_id': (4431,), 'sym_id': (4431,), 'entity_id': (4431,), 'deletion_matrix': (893, 4431), 'deletion_mean': (
4431,), 'all_atom_mask': (4431, 37), 'all_atom_positions': (4431, 37, 3), 'assembly_num_chains': (), 'entity_mask': (4431,), 'num_templates':
 (), 'cluster_bias_mask': (893,), 'bert_mask': (893, 4431), 'seq_mask': (4431,), 'msa_mask': (893, 4431)}
/nfs/scistore07/clustersw/debian/bullseye/cuda11.2/alphafold/2.2.4c/alphafold-2.2.4/alphafold/model/geometry/struct_of_array.py:136: FutureWa
rning: jax.tree_flatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_flatten instead.
  flat_array_like, inner_treedef = jax.tree_flatten(array_like)
/nfs/scistore07/clustersw/debian/bullseye/cuda11.2/alphafold/2.2.4c/alphafold-2.2.4/alphafold/model/geometry/struct_of_array.py:209: FutureWa
rning: jax.tree_unflatten is deprecated, and will be removed in a future release. Use jax.tree_util.tree_unflatten instead.
  value_dict[array_field] = jax.tree_unflatten(
2023-02-17 02:02:28.871775: E external/org_tensorflow/tensorflow/compiler/xla/shape_util.cc:311] INVALID_ARGUMENT: invalid shape type=11, dim
s=[-1781845888]
2023-02-17 02:02:28.879099: F external/org_tensorflow/tensorflow/tsl/platform/statusor.cc:33] Attempting to fetch value instead of handling e
rror INVALID_ARGUMENT: invalid shape type=11, dims=[-1781845888]
Fatal Python error: Aborted

Current thread 0x00001495afb71740 (most recent call first):
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 994 in backend_
compile
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/profiler.py", line 313 in wrapper
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 1054 in compile
_or_get_cached
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 1136 in from_xl
a_computation
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 978 in compile
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 342 in _xla_cal
lable_uncached
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/linear_util.py", line 309 in memoized_f
un
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/dispatch.py", line 234 in _xla_cal
l_impl
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/core.py", line 701 in process_call
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/core.py", line 1955 in call_bind
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/core.py", line 1939 in bind
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/api.py", line 606 in cache_miss
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/jax/_src/traceback_util.py", line 162 in re
raise_with_filtered_traceback
  File "/nfs/scistore07/clustersw/debian/bullseye/cuda11.2/alphafold/2.2.4c/alphafold-2.2.4/alphafold/model/model.py", line 167 in predict
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/alphafold-2.2.4/run_alphafold.py", line 198 in predict_structure
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/alphafold-2.2.4/run_alphafold.py", line 398 in main
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/absl/app.py", line 258 in _run_main
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/lib/python3.9/site-packages/absl/app.py", line 312 in run
  File "/mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/alphafold/2.2.4c/alphafold-2.2.4/run_alphafold.py", line 422 in <module>
/var/lib/slurm/slurmd/job2543698/slurm_script: line 39: 3334595 Aborted                 python3 /mnt/nfs/clustersw/Debian/bullseye/cuda/11.2/
alphafold/2.2.4c/alphafold-2.2.4/run_alphafold.py --model_preset=multimer --fasta_paths=$MY_PROTEIN_PATH --output_dir=$(dirname $MY_PROTEIN_P
ATH) --data_dir=/nfs/scistore14/rcsb/alphafold.databases/ --mgnify_database_path=/nfs/scistore14/rcsb/alphafold.databases/mgnify/mgy_clusters
_2018_12.fa --template_mmcif_dir=/nfs/scistore14/rcsb/alphafold.databases.v2/pdb_mmcif/mmcif_files/ --max_template_date=2020-05-14 --obsolete
_pdbs_path=/nfs/scistore14/rcsb/alphafold.databases.v2/pdb_mmcif/obsolete.dat --use_gpu_relax=true --bfd_database_path=/nfs/scistore14/rcsb/a
lphafold.databases/bfd/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --uniclust30_database_path=/nfs/scistore14/rcsb/alphafold.dat
abases/uniclust30/uniclust30_2018_08/uniclust30_2018_08 --uniref90_database_path=/nfs/scistore14/rcsb/alphafold.databases/uniref90/uniref90.f
asta --pdb_seqres_database_path=/nfs/scistore14/rcsb/alphafold.databases/pdb_seqres/pdb_seqres.txt --uniprot_database_path=/nfs/scistore14/rc
sb/alphafold.databases/uniprot/uniprot.fasta

DrJesseHansen commented 1 year ago

note that I've tried running this on various GPUs and none work. This includes A40's which have 40 GB vram.

google-deepmind / alphafold

Enabling Unified Memory on Compute Cluster #696