HPC mode crash - Githubissues

HPC mode is not working Executing Slurm script with configurations for HPC mode. Crashing. Logs attached. For the test, I changed the mode from HPC to Local. In this case, the same config file was working. What could be the problem with using HPC mode + Slurm?

SLURM script

#!/bin/bash
#SBATCH --partition=Tucana
#SBATCH --job-name=config-test3
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=4
#SBATCH --mem=500gb
#SBATCH --time=48:00:00
#SBATCH --output=%x_%j.out
echo "======================================================"
echo "Start Time : $(date)"
echo "Submit Dir : $SLURM_SUBMIT_DIR"
echo "Job ID/Name : $SLURM_JOBID / $SLURM_JOB_NAME"
echo "Num Tasks : $SLURM_NTASKS total [$SLURM_NNODES nodes @ $SLURM_CPUS_ON_NODE CPUs/node]"
echo "Hostname : $HOSTNAME"
echo "======================================================"
echo ""

# Go to the working dir:

cd ${SLURM_SUBMIT_DIR}

# Load required modules:

module load anaconda3/2022.10
module list

# Get to the HADDOCK 3 dir:

HADDOCK_DIR="/users/aduchenk/software/haddock3/haddock3"
cd ${HADDOCK_DIR}
pwd

# Prepare HADDOCK env:

conda activate haddock3
wait

# Move to the examples dir and chose :

cd ${HADDOCK_DIR}/examples/docking-antibody-antigen
haddock3 docking-antibody-antigen-ranairCDR-clt-full-hpc.cfg

echo ""
echo "======================================================"
echo "End Time : $(date)"

[Non-working] Cfg file for HPC mode (docking-antibody-antigen-ranairCDR-clt-full-hpc.cfg)

# ====================================================================
# Protein-protein docking example with NMR-derived ambiguous interaction restraints

# directory in which the scoring will be done
run_dir = "run1-ranairCDR-cltsel-full"

# compute mode
mode = "hpc"
# batch system
batch_type = "slurm"
# queue name
queue = "short"

# in which queue the jobs should run, if nothing is defined
#  it will take the system's default
# queue = "short"
# concatenate models inside each job, concat = 5 each .job will produce 5 models
concat = 5
#  Limit the number of concurrent submissions to the queue
queue_limit = 4

molecules =  [
    "data/4G6K_fv.pdb",
    "data/4I1B-matched.pdb"

ERROR in Log

[2023-12-07 14:04:25,819 libutil ERROR] list index out of range
Traceback (most recent call last):
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libutil.py", line 335, in log_error_and_exit
    yield
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/clis/cli.py", line 185, in main
    workflow.run()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libworkflow.py", line 43, in run
    step.execute()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libworkflow.py", line 152, in execute
    self.module.run()  # type: ignore
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/modules/base_cns_module.py", line 61, in run
    self._run()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/modules/topology/topoaa/__init__.py", line 215, in _run
    engine.run()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libhpc.py", line 181, in run
    worker.run()
  File "/users/aduchenk/software/haddock3/haddock3/src/haddock/libs/libhpc.py", line 103, in run
    self.job_id = int(p.stdout.decode("utf-8").split()[-1])
IndexError: list index out of range
[2023-12-07 14:04:25,821 libutil ERROR] list index out of range

For tests: I tried the same config file and just changed HPC mode for Local:

[Working] Cfg file for Local mode

# ====================================================================
# Protein-protein docking example with NMR-derived ambiguous interaction restraints

# directory in which the scoring will be done
run_dir = "run1-ranairCDR-cltsel-full"

# compute mode
mode = "local"
# batch system
batch_type = "slurm"
# queue name
queue = "short"

# in which queue the jobs should run, if nothing is defined
#  it will take the system's default
# queue = "short"
# concatenate models inside each job, concat = 5 each .job will produce 5 models
concat = 5
#  Limit the number of concurrent submissions to the queue
queue_limit = 250"

[Working] Log file

[2023-12-07 13:29:51,776 __init__ INFO] [topoaa] Running CNS Jobs n=2
[2023-12-07 13:29:51,776 libutil INFO] Selected 2 cores to process 2 jobs, with 64 maximum available cores.
[2023-12-07 13:29:51,776 libparallel INFO] Using 2 cores
[2023-12-07 13:29:55,459 libparallel INFO] >> /4G6K_fv.inp completed 50% 
[2023-12-07 13:29:55,459 libparallel INFO] >> /4I1B-matched.inp completed 100% 
[2023-12-07 13:29:55,459 libparallel INFO] 2 tasks finished
[2023-12-07 13:29:55,459 __init__ INFO] [topoaa] CNS jobs have finished
[2023-12-07 13:29:55,473 base_cns_module INFO] Module [topoaa] finished.
[2023-12-07 13:29:55,473 __init__ INFO] [topoaa] took 4 seconds
[2023-12-07 13:29:56,500 base_cns_module INFO] Running [rigidbody] module
[2023-12-07 13:29:56,502 __init__ INFO] [rigidbody] crossdock=true
[2023-12-07 13:29:56,502 __init__ INFO] [rigidbody] Preparing jobs...
[2023-12-07 13:30:41,902 __init__ INFO] [rigidbody] Running CNS Jobs n=10000
[2023-12-07 13:30:41,903 libutil INFO] Selected 8 cores to process 10000 jobs, with 64 maximum available cores.
[2023-12-07 13:30:41,938 libparallel INFO] Using 8 cores"

haddocking / haddock3

HPC mode crash #755