haddocking / haddock3

Official repo of the modular BioExcel version of HADDOCK
https://www.bonvinlab.org/haddock3
Apache License 2.0
101 stars 34 forks source link

HPC mode crash #755

Closed AnastasiiaDuchenko closed 9 months ago

AnastasiiaDuchenko commented 10 months ago

HPC mode is not working Executing Slurm script with configurations for HPC mode. Crashing. Logs attached. For the test, I changed the mode from HPC to Local. In this case, the same config file was working. What could be the problem with using HPC mode + Slurm?

SLURM script

#!/bin/bash
#SBATCH --partition=Tucana
#SBATCH --job-name=config-test3
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=500gb
#SBATCH --time=48:00:00
#SBATCH --output=%x_%j.out
echo "======================================================"
echo "Start Time : $(date)"
echo "Submit Dir : $SLURM_SUBMIT_DIR"
echo "Job ID/Name : $SLURM_JOBID / $SLURM_JOB_NAME"
echo "Num Tasks : $SLURM_NTASKS total [$SLURM_NNODES nodes @ $SLURM_CPUS_ON_NODE CPUs/node]"
echo "Hostname : $HOSTNAME"
echo "======================================================"
echo ""

# Go to the working dir:

cd ${SLURM_SUBMIT_DIR}

# Load required modules:

module load anaconda3/2022.10
module list

# Get to the HADDOCK 3 dir:

HADDOCK_DIR="/users/aduchenk/software/haddock3/haddock3"
cd ${HADDOCK_DIR}
pwd

# Prepare HADDOCK env:

conda activate haddock3
wait

# Move to the examples dir and chose :

cd ${HADDOCK_DIR}/examples/docking-antibody-antigen
haddock3 docking-antibody-antigen-ranairCDR-clt-full-hpc.cfg

echo ""
echo "======================================================"
echo "End Time : $(date)"

[Non-working] Cfg file for HPC mode (docking-antibody-antigen-ranairCDR-clt-full-hpc.cfg)

# ====================================================================
# Protein-protein docking example with NMR-derived ambiguous interaction restraints

# directory in which the scoring will be done
run_dir = "run1-ranairCDR-cltsel-full"

# compute mode
mode = "hpc"
# batch system
batch_type = "slurm"
# queue name
queue = "short"

# in which queue the jobs should run, if nothing is defined
#  it will take the system's default
# queue = "short"
# concatenate models inside each job, concat = 5 each .job will produce 5 models
concat = 5
#  Limit the number of concurrent submissions to the queue
queue_limit = 4

molecules =  [
    "data/4G6K_fv.pdb",
    "data/4I1B-matched.pdb"

ERROR in Log

[2023-12-07 14:04:25,819 libutil ERROR] list index out of range
Traceback (most recent call last):
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libutil.py", line 335, in log_error_and_exit
    yield
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/clis/cli.py", line 185, in main
    workflow.run()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libworkflow.py", line 43, in run
    step.execute()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libworkflow.py", line 152, in execute
    self.module.run()  # type: ignore
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/modules/base_cns_module.py", line 61, in run
    self._run()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/modules/topology/topoaa/__init__.py", line 215, in _run
    engine.run()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libhpc.py", line 181, in run
    worker.run()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libhpc.py", line 103, in run
    self.job_id = int(p.stdout.decode("utf-8").split()[-1])
IndexError: list index out of range
[2023-12-07 14:04:25,821 libutil ERROR] list index out of range

For tests: I tried the same config file and just changed HPC mode for Local:

[Working] Cfg file for Local mode

# ====================================================================
# Protein-protein docking example with NMR-derived ambiguous interaction restraints

# directory in which the scoring will be done
run_dir = "run1-ranairCDR-cltsel-full"

# compute mode
mode = "local"
# batch system
batch_type = "slurm"
# queue name
queue = "short"

# in which queue the jobs should run, if nothing is defined
#  it will take the system's default
# queue = "short"
# concatenate models inside each job, concat = 5 each .job will produce 5 models
concat = 5
#  Limit the number of concurrent submissions to the queue
queue_limit = 250"

[Working] Log file

[2023-12-07 13:29:51,776 __init__ INFO] [topoaa] Running CNS Jobs n=2
[2023-12-07 13:29:51,776 libutil INFO] Selected 2 cores to process 2 jobs, with 64 maximum available cores.
[2023-12-07 13:29:51,776 libparallel INFO] Using 2 cores
[2023-12-07 13:29:55,459 libparallel INFO] >> /4G6K_fv.inp completed 50% 
[2023-12-07 13:29:55,459 libparallel INFO] >> /4I1B-matched.inp completed 100% 
[2023-12-07 13:29:55,459 libparallel INFO] 2 tasks finished
[2023-12-07 13:29:55,459 __init__ INFO] [topoaa] CNS jobs have finished
[2023-12-07 13:29:55,473 base_cns_module INFO] Module [topoaa] finished.
[2023-12-07 13:29:55,473 __init__ INFO] [topoaa] took 4 seconds
[2023-12-07 13:29:56,500 base_cns_module INFO] Running [rigidbody] module
[2023-12-07 13:29:56,502 __init__ INFO] [rigidbody] crossdock=true
[2023-12-07 13:29:56,502 __init__ INFO] [rigidbody] Preparing jobs...
[2023-12-07 13:30:41,902 __init__ INFO] [rigidbody] Running CNS Jobs n=10000
[2023-12-07 13:30:41,903 libutil INFO] Selected 8 cores to process 10000 jobs, with 64 maximum available cores.
[2023-12-07 13:30:41,938 libparallel INFO] Using 8 cores"
amjjbonvin commented 10 months ago

If you send the full haddock3 process to slurm you should run the workflow in local mode.

Check the following tutorial for explanations on the running modes:

https://www.bonvinlab.org/education/HADDOCK3/HADDOCK3-antibody-antigen-bioexcel2023/#running-haddock3