prehensilecode / alphafold_singularity

Singularity recipe for AlphaFold
GNU General Public License v3.0
33 stars 12 forks source link

stockholm msa write failed - system error: No space left on device #24

Closed prehensilecode closed 1 year ago

prehensilecode commented 1 year ago

Working on update to 2.3.1. Running test:

python3 ${ALPHAFOLD_DIR}/singularity/run_singularity.py \
    --use_gpu --gpu_devices=${SLURM_JOB_GPUS} \
    --data_dir=${ALPHAFOLD_DATADIR} \
    --fasta_paths=T1050.fasta \
    --max_template_date=2020-05-14 \
    --model_preset=monomer_casp14

All good up until “stockholm msa write failed”:

SLURM_JOB_GPUS=0,1,2,3
ALPHAFOLD_DIR=/ifs/opt/alphafold/2.3.1
ALPHAFOLD_DATADIR=/beegfs/AlphaFoldDatabases-2-3-1
I0224 10:57:57.344248 23456247859008 run_singularity.py:121] Binding /ifs/sysadmin/Testing/AlphaFold -> /mnt/fasta_path_0
I0224 10:57:57.344326 23456247859008 run_singularity.py:121] Binding /beegfs/AlphaFoldDatabases-2-3-1/uniref90 -> /mnt/uniref90_database_path
I0224 10:57:57.344369 23456247859008 run_singularity.py:121] Binding /beegfs/AlphaFoldDatabases-2-3-1/mgnify -> /mnt/mgnify_database_path
I0224 10:57:57.344406 23456247859008 run_singularity.py:121] Binding /beegfs/AlphaFoldDatabases-2-3-1 -> /mnt/data_dir
I0224 10:57:57.344439 23456247859008 run_singularity.py:121] Binding /beegfs/AlphaFoldDatabases-2-3-1/pdb_mmcif -> /mnt/template_mmcif_dir
I0224 10:57:57.344476 23456247859008 run_singularity.py:121] Binding /beegfs/AlphaFoldDatabases-2-3-1/pdb_mmcif -> /mnt/obsolete_pdbs_path
I0224 10:57:57.344508 23456247859008 run_singularity.py:121] Binding /beegfs/AlphaFoldDatabases-2-3-1/pdb70 -> /mnt/pdb70_database_path
I0224 10:57:57.344553 23456247859008 run_singularity.py:121] Binding /beegfs/AlphaFoldDatabases-2-3-1/uniref30 -> /mnt/uniref30_database_path
I0224 10:57:57.344601 23456247859008 run_singularity.py:121] Binding /beegfs/AlphaFoldDatabases-2-3-1/bfd -> /mnt/bfd_database_path
/ifs/opt/alphafold/2.3.1/alphafold.sif
singularity run --nv --bind /ifs/sysadmin/Testing/AlphaFold:/mnt/fasta_path_0,/beegfs/AlphaFoldDatabases-2-3-1/uniref90:/mnt/uniref90_database_path,/beegfs/AlphaFoldDatabases-2-3-1/mgnify:/mnt/mgnify_database_path,/beegfs/AlphaFoldDatabases-2-3-1:/mnt/data_dir,/beegfs/AlphaFoldDatabases-2-3-1/pdb_mmcif:/mnt/template_mmcif_dir,/beegfs/AlphaFoldDatabases-2-3-1/pdb_mmcif:/mnt/obsolete_pdbs_path,/beegfs/AlphaFoldDatabases-2-3-1/pdb70:/mnt/pdb70_database_path,/beegfs/AlphaFoldDatabases-2-3-1/uniref30:/mnt/uniref30_database_path,/beegfs/AlphaFoldDatabases-2-3-1/bfd:/mnt/bfd_database_path,/local/scratch/8794444:/mnt/output --env OPENMM_CPU_THREADS=12 --env TF_FORCE_UNIFIED_MEMORY=1 --env XLA_PYTHON_CLIENT_MEM_FRACTION=4.0 /ifs/opt/alphafold/2.3.1/alphafold.sif --fasta_paths=/mnt/fasta_path_0/T1050.fasta --uniref90_database_path=/mnt/uniref90_database_path/uniref90.fasta --mgnify_database_path=/mnt/mgnify_database_path/mgy_clusters_2022_05.fa --data_dir=/mnt/data_dir --template_mmcif_dir=/mnt/template_mmcif_dir/mmcif_files --obsolete_pdbs_path=/mnt/obsolete_pdbs_path/obsolete.dat --pdb70_database_path=/mnt/pdb70_database_path/pdb70 --uniref30_database_path=/mnt/uniref30_database_path/UniRef30_2021_03 --bfd_database_path=/mnt/bfd_database_path/bfd_metaclust_clu_complete_id30_c90_final_seq.sorted_opt --output_dir=/mnt/output --max_template_date=2020-05-14 --db_preset=full_dbs --model_preset=monomer_casp14 --benchmark=False --use_precomputed_msas=False --num_multimer_predictions_per_model=5 --run_relax=True --use_gpu_relax=True --logtostderr
/sbin/ldconfig.real: Can't create temporary cache file /etc/ld.so.cache~: Read-only file system
I0224 10:58:04.423412 23456247981888 templates.py:857] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat.
I0224 10:58:04.603165 23456247981888 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0224 10:58:05.062923 23456247981888 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Host CUDA Interpreter
I0224 10:58:05.063276 23456247981888 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0224 10:58:05.063360 23456247981888 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
I0224 10:58:08.165593 23456247981888 run_alphafold.py:386] Have 5 models: ['model_1_pred_0', 'model_2_pred_0', 'model_3_pred_0', 'model_4_pred_0', 'model_5_pred_0']
I0224 10:58:08.165745 23456247981888 run_alphafold.py:403] Using random seed 1666343438848163300 for the data pipeline
I0224 10:58:08.165915 23456247981888 run_alphafold.py:161] Predicting T1050
I0224 10:58:08.180227 23456247981888 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmp2424h2pr/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/T1050.fasta /mnt/uniref90_database_path/uniref90.fasta"
I0224 10:58:08.212937 23456247981888 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0224 11:05:24.744865 23456247981888 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 436.532 seconds
I0224 11:05:28.540683 23456247981888 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmpogq3j1ny/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/T1050.fasta /mnt/mgnify_database_path/mgy_clusters_2022_05.fa"
I0224 11:05:28.570472 23456247981888 utils.py:36] Started Jackhmmer (mgy_clusters_2022_05.fa) query
I0224 11:19:06.830706 23456247981888 utils.py:40] Finished Jackhmmer (mgy_clusters_2022_05.fa) query in 818.260 seconds
Traceback (most recent call last):
  File "/app/alphafold/run_alphafold.py", line 432, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/app/alphafold/run_alphafold.py", line 408, in main
    predict_structure(
  File "/app/alphafold/run_alphafold.py", line 172, in predict_structure
    feature_dict = data_pipeline.process(
  File "/app/alphafold/alphafold/data/pipeline.py", line 171, in process
    jackhmmer_mgnify_result = run_msa_tool(
  File "/app/alphafold/alphafold/data/pipeline.py", line 94, in run_msa_tool
    result = msa_runner.query(input_fasta_path, max_sto_sequences)[0]  # pytype: disable=wrong-arg-count
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 170, in query
    return self.query_multiple([input_fasta_path], max_sequences)[0]
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 181, in query_multiple
    single_chunk_results.append([self._query_chunk(
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 142, in _query_chunk
    raise RuntimeError(
RuntimeError: Jackhmmer failed
stderr:
Fatal exception (source file esl_msafile_stockholm.c, line 1263):
stockholm msa write failed
system error: No space left on device

/sbin/ldconfig.real: Can't create temporary cache file /etc/ld.so.cache~: Read-only file system
I0224 10:58:04.423412 23456247981888 templates.py:857] Using precomputed obsolete pdbs /mnt/obsolete_pdbs_path/obsolete.dat.
I0224 10:58:04.603165 23456247981888 xla_bridge.py:353] Unable to initialize backend 'tpu_driver': NOT_FOUND: Unable to find driver in registry given worker:
I0224 10:58:05.062923 23456247981888 xla_bridge.py:353] Unable to initialize backend 'rocm': NOT_FOUND: Could not find registered platform with name: "rocm". Available platform names are: Host CUDA Interpreter
I0224 10:58:05.063276 23456247981888 xla_bridge.py:353] Unable to initialize backend 'tpu': module 'jaxlib.xla_extension' has no attribute 'get_tpu_client'
I0224 10:58:05.063360 23456247981888 xla_bridge.py:353] Unable to initialize backend 'plugin': xla_extension has no attributes named get_plugin_device_client. Compile TensorFlow with //tensorflow/compiler/xla/python:enable_plugin_device set to true (defaults to false) to enable this.
I0224 10:58:08.165593 23456247981888 run_alphafold.py:386] Have 5 models: ['model_1_pred_0', 'model_2_pred_0', 'model_3_pred_0', 'model_4_pred_0', 'model_5_pred_0']
I0224 10:58:08.165745 23456247981888 run_alphafold.py:403] Using random seed 1666343438848163300 for the data pipeline
I0224 10:58:08.165915 23456247981888 run_alphafold.py:161] Predicting T1050
I0224 10:58:08.180227 23456247981888 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmp2424h2pr/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/T1050.fasta /mnt/uniref90_database_path/uniref90.fasta"
I0224 10:58:08.212937 23456247981888 utils.py:36] Started Jackhmmer (uniref90.fasta) query
I0224 11:05:24.744865 23456247981888 utils.py:40] Finished Jackhmmer (uniref90.fasta) query in 436.532 seconds
I0224 11:05:28.540683 23456247981888 jackhmmer.py:133] Launching subprocess "/usr/bin/jackhmmer -o /dev/null -A /tmp/tmpogq3j1ny/output.sto --noali --F1 0.0005 --F2 5e-05 --F3 5e-07 --incE 0.0001 -E 0.0001 --cpu 8 -N 1 /mnt/fasta_path_0/T1050.fasta /mnt/mgnify_database_path/mgy_clusters_2022_05.fa"
I0224 11:05:28.570472 23456247981888 utils.py:36] Started Jackhmmer (mgy_clusters_2022_05.fa) query
I0224 11:19:06.830706 23456247981888 utils.py:40] Finished Jackhmmer (mgy_clusters_2022_05.fa) query in 818.260 seconds
Traceback (most recent call last):
  File "/app/alphafold/run_alphafold.py", line 432, in <module>
    app.run(main)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/opt/conda/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "/app/alphafold/run_alphafold.py", line 408, in main
    predict_structure(
  File "/app/alphafold/run_alphafold.py", line 172, in predict_structure
    feature_dict = data_pipeline.process(
  File "/app/alphafold/alphafold/data/pipeline.py", line 171, in process
    jackhmmer_mgnify_result = run_msa_tool(
  File "/app/alphafold/alphafold/data/pipeline.py", line 94, in run_msa_tool
    result = msa_runner.query(input_fasta_path, max_sto_sequences)[0]  # pytype: disable=wrong-arg-count
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 170, in query
    return self.query_multiple([input_fasta_path], max_sequences)[0]
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 181, in query_multiple
    single_chunk_results.append([self._query_chunk(
  File "/app/alphafold/alphafold/data/tools/jackhmmer.py", line 142, in _query_chunk
    raise RuntimeError(
RuntimeError: Jackhmmer failed
stderr:
Fatal exception (source file esl_msafile_stockholm.c, line 1263):
stockholm msa write failed
system error: No space left on device
INFO: AlphaFold returned 0
prehensilecode commented 1 year ago

Probably this same issue in AlphaFold: https://github.com/deepmind/alphafold/issues/280

prehensilecode commented 1 year ago

Workaround: modify the run_singularity.py script to bind job TMP (or TMPDIR) to container /tmp

prehensilecode commented 1 year ago

Fixed by 1c7efe8487dcb8cda9f793cb3bdcec4ce5ab21e0