Using docker image on cluster gives error

razvangamanut commented 3 months ago

Describe the bug I ran on the University HPC a python script using NEST 3.8 docker image with OpenMPI and Singularity. When I used slurm, it gave errors.

To Reproduce Steps to reproduce the behavior:

(Minimal) reproducing example

The main commands in the slurm file were:

module load openmpi.gcc/4.0.3 module load singularity

srun --mpi=pmix singularity run ./nest.sif python3 simulation.py [or] mpirun -n 8 singularity run ./nest.sif python3 simulation.py

Screenshots [when using srun --mpi=pmix] : PMIX ERROR: ERROR in file ../../../../../../src/mca/gds/ds12/gds_ds12_lock_pthread.c at line 169

[when using mpirun -n 8]:

It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer):

getting local rank failed --> Returned value No permission (-17) instead of ORTE_SUCCESS

It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer):

orte_ess_init failed --> Returned value No permission (-17) instead of ORTE_SUCCESS

It looks like MPI_INIT failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during MPI_INIT; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer):

ompi_mpi_init: ompi_rte_init failed --> Returned "No permission" (-17) instead of "Success" (0)

An error occurred in MPI_Init_thread on a NULL communicator MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort, and potentially your MPI job) Local abort before MPI_INIT completed completed successfully, but am not able to aggregate error messages, and not able to guarantee that all other processes were killed!

Primary job terminated normally, but 1 process returned a non-zero exit code. Per user-direction, the job has been aborted.

mpirun detected that one or more processes exited with non-zero status, thus causing the job to be terminated. The first process to do so was:

Process name: [[28962,1],0] Exit code: 1

Desktop/Environment (please complete the following information):

NEST-Version: NEST 3.8 docker image

terhorstd commented 3 months ago

my apologies, closed accidentally

gtrensch commented 3 months ago

@razvangamanut, the problem is most likely caused by an incompatibility of the MPI libraries. We are afraid that for your HPC system, NEST needs to be built from source. This is necessary to include the system-specific MPI libraries. Currently, we do not know how to properly handle external site-specific MPI setups, especially for HPC systems. You may also contact the administrators of your HPC system if they know of a solution. We would also be very interested in such expertise.

hamannju commented 1 month ago

Hello, I had a similar issue with Nest / OpenMPI compatibility while getting some old code to run to help a PhD student. We worked on the Imperial College London cluster and there you are able to utilize conda environments from the user profile, so we built a version of Nest 2.20.2 into a clean conda environment and that worked well with MPI etc.

This is the repo with the code: https://github.com/hamannju/anaconda-nest

It was a little messy, but if the user can bring his/her own conda env, then it runs on an HPC cluster.

nest / nest-simulator