steineggerlab / foldseek

Foldseek enables fast and sensitive comparisons of large structure sets.
https://foldseek.com
GNU General Public License v3.0
771 stars 100 forks source link

foldseek-mpi Error: Structure alignment step died #328

Open vmkhot opened 1 month ago

vmkhot commented 1 month ago

Expected Behavior

database to database "foldseek search" alignment using foldseek-mpi

Current Behavior

The structure alignment step dies after it sets up the jobs for the structural alignment.

Could not delete tmp_v2/6428288360645111440/strualn.0!
--------------------------------------------------------------------------
Primary job  terminated normally, but 1 process returned
a non-zero exit code. Per user-direction, the job has been aborted.
--------------------------------------------------------------------------
Could not delete tmp_v2/6428288360645111440/strualn.0!
Could not delete tmp_v2/6428288360645111440/strualn.0!
--------------------------------------------------------------------------
mpirun detected that one or more processes exited with non-zero status, thus causing
the job to be terminated. The first process to do so was:

  Process name: [[9314,1],18]
  Exit code:    1
--------------------------------------------------------------------------
Error: Structure alignment step died

What I ran

#!/bin/bash
#SBATCH --job-name=foldseek
#SBATCH --partition=long
#SBATCH --time=168:00:00
#SBATCH --ntasks=30
#SBATCH --mem=200G
#SBATCH --output=foldseek_mpi_%j_2.log

module load mpi/openmpi/4.1.1

# conda environment
source ~/miniconda3/etc/profile.d/conda.sh
conda activate foldseek-mpi  # Activate conda environment 

# MPI run command
~/Programs/foldseek/build-mpi/src/foldseek search ../phold/phold_CR_MCP_foldseek_db/phold_foldseek_db ../phold/phold_CR_MCP_foldseek_db/phold_foldseek_db CR_MCP_foldseek_alignment_db tmp_v2 -a -e 1.000E-03 -v 3 --threads 30 --mpi-runner "mpirun -np 30 -mca ras_base_verbose 10 --display-allocation"

Foldseek log

foldseek_issue_log.txt

Context

Your Environment

MMseqs Version: 16dc9150581778c2c65a153ed2e6e418d29fafe3-MPI

Foldseek was self-compiled using the MPI flag

conda create -n foldseek-mpi
conda activate foldseek-mpi
conda install gcc
conda install cmake
conda install conda-forge::openmpi-mpicc
conda install conda-forge::openmpi-mpicxx
git clone https://github.com/steineggerlab/foldseek.git
cd foldseek/ && mkdir build-mpi && cd build-mpi
cmake -DCMAKE_BUILD_TYPE=RELEASE -DCMAKE_INSTALL_PREFIX=. -DHAVE_MPI=1 ..
make

Thoughts

Your help is most appreciated!

Thanks, Varada

milot-mirdita commented 1 month ago

the tmp directory has to be shared between all MPI compute nodes through some other mechanism (e.g., NFS).

This looks like the other nodes cannot access the tmp_v2 from another node?

The larger issue is that we don't really test our MPI code anymore since we moved away from many low-CPU-core machines to few high-CPU-core machines. So I can't promise that the MPI implementation hasn't bitrotted away.