open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 865 forks source link

Error using MPI_Comm_connect/MPI_Comm_accept #6916

Open nuriallv opened 5 years ago

nuriallv commented 5 years ago

Details of the problem

I'm getting the following error when using MPI_Comm_connect/MPI_Comm_accept on a single node between processes spawned on different mpirun commands.

[d14.descartes:04196] [[23420,0],0] ORTE_ERROR_LOG: Not supported in file orted/pmix/pmix_server_dyn.c at line 702
[d14.descartes:04202] [[23420,1],0] ORTE_ERROR_LOG: Not supported in file dpm/dpm.c at line 403
--------------------------------------------------------------------------
Your application has invoked an MPI function that is not supported in
this environment.

  MPI function: MPI_Comm_accept
  Reason:       Underlying runtime environment does not support accept/connect functionality
--------------------------------------------------------------------------
[d14:04202] *** An error occurred in MPI_Comm_accept
[d14:04202] *** reported by process [1534853121,0]
[d14:04202] *** on communicator MPI_COMM_SELF
[d14:04202] *** MPI_ERR_INTERN: internal error
[d14:04202] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[d14:04202] ***    and potentially your MPI job)
--------------------------------------------------------------------------
At least one pair of MPI processes are unable to reach each other for
MPI communications.  This means that no Open MPI device has indicated
that it can be used to communicate between these processes.  This is
an error; Open MPI requires that all MPI processes be able to reach
each other.  This error can sometimes be the result of forgetting to
specify the "self" BTL.

  Process 1 ([[23414,1],0]) is on host: d14
  Process 2 ([[23420,1],0]) is on host: unknown!
  BTLs attempted: self tcp

Your MPI job is now going to abort; sorry.
--------------------------------------------------------------------------
[d14.descartes:04211] [[23414,1],0] ORTE_ERROR_LOG: Unreachable in file dpm/dpm.c at line 495
[d14:04211] *** An error occurred in MPI_Comm_connect
[d14:04211] *** reported by process [1534459905,0]
[d14:04211] *** on communicator MPI_COMM_SELF
[d14:04211] *** MPI_ERR_INTERN: internal error
[d14:04211] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[d14:04211] ***    and potentially your MPI job)

If instead I spawn a new process within the same mpirun there no error. I'm attaching a reproducer. reproducer.zip

Environment

OpenMPI master commit 69bd9453e9614cdfd7370f9fb4cfa8dfec72f27d ./configure --prefix=${HOME}/ompi/Build --enable-orterun-prefix-by-default --with-platform=optimized

Linux cluster, resource manager Slurm, nodes: Dell T7500 chassis 2x Gainestown E5520 @2.27GHz. 8 cores, 16 ht ~12G RAM (actual amount varies between 8G and 16G) Infiniband DDR 20G (MT25208 cards), Ethernet

rhc54 commented 5 years ago

You have a couple of options. The problem is that the two mpirun's need a rendezvous server. One way to do it is to start ompi-server before executing the mpirun cmds and then point each mpirun at that process for the rendezvous.

The other option is to use PRRTE (https://github.com/pmix/prrte) as a "shim" environment. You would get a Slurm allocation, then start PRRTE and run your apps using PRRTE's "prun" command (which is identical to mpirun). PRRTE knows how to provide the rendezvous service. If you want to use PRRTE, then the build/use instructions are available at the bottom of https://pmix.org/support/how-to/