open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.15k stars 858 forks source link

MPI_Init failed with ompi, pmix, Slurm #6095

Closed asanchez1987 closed 3 years ago

asanchez1987 commented 5 years ago

I'm trying to test Slurm + PMIx + Open MPI with the following software versions and configuration requests:

PMIx 3.1 with HEAD at 6f384bf19ae5a99f885871c2232583efbbbaf1ab

$ ../../source/configure --prefix=/home/alex/repos/pmix/install/3.1

Slurm 18.08 with HEAD at 3f5c0e58187af8903da0aeb967a45460b0ea4328 (future 18.08.4)

$ ../../slurm/configure --prefix=/home/alex/slurm/18.08/polaris --enable-multiple-slurmd --with-pmix=/home/alex/repos/pmix/install/3.1 --enable-developer
alex@polaris:~/t$ scontrol show conf | grep TmpFS
TmpFS                   = /home/alex/slurm/18.08/polaris/spool/slurmd-tmpfs-%n
alex@polaris:~/t$ ls -l /home/alex/slurm/18.08/polaris/spool/slurmd-tmpfs-compute*
/home/alex/slurm/18.08/polaris/spool/slurmd-tmpfs-compute1:
total 0

/home/alex/slurm/18.08/polaris/spool/slurmd-tmpfs-compute2:
total 0
alex@polaris:~/t$
contribs/pmi2 is also installed
alex@polaris:~/t$ srun --mpi=list
srun: MPI types are...
srun: pmix
srun: pmix_v3
srun: none
srun: openmpi
srun: pmi2
alex@polaris:~/t$

Slurm + PMIx (without OpenMPI) seem to work:

alex@polaris:~/t$ srun --mpi=pmix_v3 -n2 -N2 ~/repos/pmix/build/3.1/test/pmix_client -n 2 --job-fence -c
OK
OK
alex@polaris:~/t$

OpenMPI v4.0.x commit 725f62554e10683dc2620d45b62d642380992516 (HEAD -> v4.0.x, tag: v4.0.0, origin/v4.0.x) $ ../../source/configure --prefix=/home/alex/repos/ompi/install/4.0 --with-pmix=/home/alex/repos/pmix/install/3.1 --with-pmi=/home/alex/slurm/18.08/polaris --with-libevent=/usr --with-hwloc=/usr All software pieces have been obtained with git clone.

Debian GNU/Linux testing (buster) 4.18.0-2-amd64

On one laptop with emulated Slurm nodes through --enable-multiple-slurmd and I guess pmix/ompi working on top of shared mem.

As I said, Slurm + PMIx work well. Problem comes when OpenMPI comes into play:

alex@polaris:~/t$ srun --mpi=pmix_v3 -N2 -n4 mpi/mpi_hello
[polaris:11243] PMIX ERROR: SUCCESS in file ../../../source/src/event/pmix_event_registration.c at line 98
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_interlib_declare
  --> Returned "Would block" (-10) instead of "Success" (0)
--------------------------------------------------------------------------
[polaris:11243] *** An error occurred in MPI_Init
[polaris:11243] *** reported by process [2128830693,3]
[polaris:11243] *** on a NULL communicator
[polaris:11243] *** Unknown error
[polaris:11243] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[polaris:11243] ***    and potentially your MPI job)
[polaris:11243] UNEXPECTED MESSAGE tag = 105
srun: Job step aborted: Waiting up to 32 seconds for job step to finish.
[polaris:11242] PMIX ERROR: SUCCESS in file ../../../source/src/event/pmix_event_registration.c at line 98
[polaris:11242] UNEXPECTED MESSAGE tag = 105
--------------------------------------------------------------------------
It looks like MPI_INIT failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during MPI_INIT; some of which are due to configuration or environment
problems.  This failure appears to be an internal failure; here's some
additional information (which may only be relevant to an Open MPI
developer):

  ompi_interlib_declare
  --> Returned "Would block" (-10) instead of "Success" (0)
--------------------------------------------------------------------------
[polaris:11242] *** An error occurred in MPI_Init
[polaris:11242] *** reported by process [2128830693,2]
[polaris:11242] *** on a NULL communicator
[polaris:11242] *** Unknown error
[polaris:11242] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[polaris:11242] ***    and potentially your MPI job)
slurmstepd-compute1: error: *** STEP 20023.0 ON compute1 CANCELLED AT 2018-11-19T13:54:41 ***
srun: error: compute2: task 2: Exited with exit code 1
srun: error: compute1: tasks 0-1: Killed
srun: error: compute2: task 3: Killed
alex@polaris:~/t$ srun --mpi=pmi2 -N2 -n4 mpi/mpi_hello
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  polaris
  System call: unlink(2) /dev/shm/vader_segment.polaris.4e380000.1
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
--------------------------------------------------------------------------
A system call failed during shared memory initialization that should
not have.  It is likely that your MPI job will now either abort or
experience performance degradation.

  Local host:  polaris
  System call: unlink(2) /dev/shm/vader_segment.polaris.4e380000.0
  Error:       No such file or directory (errno 2)
--------------------------------------------------------------------------
Hello world from process 2 of 4
Hello world from process 3 of 4
Hello world from process 0 of 4
Hello world from process 1 of 4
alex@polaris:~/t$

Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:

shell$ mpirun -np 2 ./hello_world

Please, let me know if you need more information.

Thanks so much. Alex

hppritcha commented 5 years ago

If its not too problematic, could you try testing against PMIx 2.1.4? My first guess looking at the output is that the slurm PMIx_v3 plugin doesn't support something Open MPI is trying to do when built against a PMIx 3.

rhc54 commented 5 years ago

It probably won't work that way either, but will work with mpirun. The issue is that the Slurm plugin does not include support for PMIx events, and the code in ompi_interlib_declare doesn't properly account for it. It likely wasn't caught pre-release since (a) OMPI does support events when run under mpirun, and (b) we don't consistently test the srun method.

asanchez1987 commented 5 years ago

Last time I attempted to test this was 2 months ago, with this slightly different combination of versions:

Slurm 18.08 PMIx 3.0 OMPI 4.0.x

And it worked.

rhc54 commented 5 years ago

Does it still work with that combination, replacing OMPI 4.0.x with the official v4.0.0 release? Just wondering if something got into OMPI at the last minute.

asanchez1987 commented 5 years ago

This fails as well, with the same error as reported.

Slurm 18.08 PMIx 3.1 OMPI 4.0.0

Maybe the problem is within the pmix 3.1, since 3.0 worked two months ago.

rhc54 commented 5 years ago

Sounds like it - will have to investigate

asanchez1987 commented 5 years ago

compiling ompi 4.0.0 using external pmix 3.0.0 errors out with

ext3x.c: In function ‘ext3x_convert_opalrc’: ext3x.c:511:16: error: ‘PMIX_OPERATION_SUCCEEDED’ undeclared (first use in this function); did you mean ‘OPAL_OPERATION_SUCCEEDED’? return PMIX_OPERATION_SUCCEEDED; ^~~~~~~~ OPAL_OPERATION_SUCCEEDED ext3x.c:511:16: note: each undeclared identifier is reported only once for each function it appears in ext3x.c: In function ‘ext3x_convert_rc’: ext3x.c:607:10: error: ‘PMIX_OPERATION_SUCCEEDED’ undeclared (first use in this function); did you mean ‘OPAL_OPERATION_SUCCEEDED’? case PMIX_OPERATION_SUCCEEDED: ^~~~~~~~ OPAL_OPERATION_SUCCEEDED

perhaps that helps with the problem?

rhc54 commented 5 years ago

Now I'm a little confused. Your earlier report showed you configuring OMPI v4.0.0 with an external PMIx v3.1 - yes? Didn't you encounter the same errors?

rhc54 commented 5 years ago

I guess what I'm trying to say is that some of these comments seem contradictory:

Can you perhaps clarify a bit?

rhc54 commented 5 years ago

Working my way thru this: I am able to build the head of the v4.0.x branch against the head of the PMIx v3.1 branch and the head of the PMIx v3.0 branch without problems (ignoring all the warnings from the rest of the code).

I don't have a way to test this under Slurm right now, but I will look at the code and see if I can spot a reason for the issue.

asanchez1987 commented 5 years ago

Sorry, I'm somehow busy these days with customer training and not been able to test versioning combinations further. Probably in a week and or so I'll be able to clarify the contradictory comments. Thanks for looking into this.

hppritcha commented 3 years ago

@asanchez1987 is this still important?

asanchez1987 commented 3 years ago

Sorry I didn't come back to this earlier. Please, disregard this issue for now. If/when I test again all the components (PMIx, Slurm, OpenMPI) with modern versions and if I find issues again I'd open a separate issue. In the meantime, we can close this for now. Thanks.