Open nuriallv opened 5 years ago
You have a couple of options. The problem is that the two mpirun's need a rendezvous server. One way to do it is to start ompi-server
before executing the mpirun cmds and then point each mpirun at that process for the rendezvous.
The other option is to use PRRTE (https://github.com/pmix/prrte) as a "shim" environment. You would get a Slurm allocation, then start PRRTE and run your apps using PRRTE's "prun" command (which is identical to mpirun
). PRRTE knows how to provide the rendezvous service. If you want to use PRRTE, then the build/use instructions are available at the bottom of https://pmix.org/support/how-to/
Details of the problem
I'm getting the following error when using MPI_Comm_connect/MPI_Comm_accept on a single node between processes spawned on different mpirun commands.
If instead I spawn a new process within the same mpirun there no error. I'm attaching a reproducer. reproducer.zip
Environment
OpenMPI master commit 69bd9453e9614cdfd7370f9fb4cfa8dfec72f27d ./configure --prefix=${HOME}/ompi/Build --enable-orterun-prefix-by-default --with-platform=optimized
Linux cluster, resource manager Slurm, nodes: Dell T7500 chassis 2x Gainestown E5520 @2.27GHz. 8 cores, 16 ht ~12G RAM (actual amount varies between 8G and 16G) Infiniband DDR 20G (MT25208 cards), Ethernet