Closed asanchez1987 closed 3 years ago
If its not too problematic, could you try testing against PMIx 2.1.4? My first guess looking at the output is that the slurm PMIx_v3 plugin doesn't support something Open MPI is trying to do when built against a PMIx 3.
It probably won't work that way either, but will work with mpirun. The issue is that the Slurm plugin does not include support for PMIx events, and the code in ompi_interlib_declare doesn't properly account for it. It likely wasn't caught pre-release since (a) OMPI does support events when run under mpirun, and (b) we don't consistently test the srun
method.
Last time I attempted to test this was 2 months ago, with this slightly different combination of versions:
Slurm 18.08 PMIx 3.0 OMPI 4.0.x
And it worked.
Does it still work with that combination, replacing OMPI 4.0.x with the official v4.0.0 release? Just wondering if something got into OMPI at the last minute.
This fails as well, with the same error as reported.
Slurm 18.08 PMIx 3.1 OMPI 4.0.0
Maybe the problem is within the pmix 3.1, since 3.0 worked two months ago.
Sounds like it - will have to investigate
compiling ompi 4.0.0 using external pmix 3.0.0 errors out with
ext3x.c: In function ‘ext3x_convert_opalrc’:
ext3x.c:511:16: error: ‘PMIX_OPERATION_SUCCEEDED’ undeclared (first use in this function); did you mean ‘OPAL_OPERATION_SUCCEEDED’?
return PMIX_OPERATION_SUCCEEDED;
^~~~~~~~
OPAL_OPERATION_SUCCEEDED
ext3x.c:511:16: note: each undeclared identifier is reported only once for each function it appears in
ext3x.c: In function ‘ext3x_convert_rc’:
ext3x.c:607:10: error: ‘PMIX_OPERATION_SUCCEEDED’ undeclared (first use in this function); did you mean ‘OPAL_OPERATION_SUCCEEDED’?
case PMIX_OPERATION_SUCCEEDED:
^~~~~~~~
OPAL_OPERATION_SUCCEEDED
perhaps that helps with the problem?
Now I'm a little confused. Your earlier report showed you configuring OMPI v4.0.0 with an external PMIx v3.1 - yes? Didn't you encounter the same errors?
I guess what I'm trying to say is that some of these comments seem contradictory:
you indicate you were able to build against PMIx v3.0 two months ago using OMPI v4.0.0 - yet now you indicate that you cannot even compile that combination
you indicated that you were able to build and run against PMIx v3.1, getting an error when running. Yet that should have failed to build as well if v3.0 won't build.
Can you perhaps clarify a bit?
Working my way thru this: I am able to build the head of the v4.0.x branch against the head of the PMIx v3.1 branch and the head of the PMIx v3.0 branch without problems (ignoring all the warnings from the rest of the code).
I don't have a way to test this under Slurm right now, but I will look at the code and see if I can spot a reason for the issue.
Sorry, I'm somehow busy these days with customer training and not been able to test versioning combinations further. Probably in a week and or so I'll be able to clarify the contradictory comments. Thanks for looking into this.
@asanchez1987 is this still important?
Sorry I didn't come back to this earlier. Please, disregard this issue for now. If/when I test again all the components (PMIx, Slurm, OpenMPI) with modern versions and if I find issues again I'd open a separate issue. In the meantime, we can close this for now. Thanks.
I'm trying to test Slurm + PMIx + Open MPI with the following software versions and configuration requests:
PMIx 3.1 with HEAD at 6f384bf19ae5a99f885871c2232583efbbbaf1ab
$ ../../source/configure --prefix=/home/alex/repos/pmix/install/3.1
Slurm 18.08 with HEAD at 3f5c0e58187af8903da0aeb967a45460b0ea4328 (future 18.08.4)
Slurm + PMIx (without OpenMPI) seem to work:
OpenMPI v4.0.x commit 725f62554e10683dc2620d45b62d642380992516 (HEAD -> v4.0.x, tag: v4.0.0, origin/v4.0.x)
$ ../../source/configure --prefix=/home/alex/repos/ompi/install/4.0 --with-pmix=/home/alex/repos/pmix/install/3.1 --with-pmi=/home/alex/slurm/18.08/polaris --with-libevent=/usr --with-hwloc=/usr
All software pieces have been obtained with git clone.Debian GNU/Linux testing (buster) 4.18.0-2-amd64
On one laptop with emulated Slurm nodes through --enable-multiple-slurmd and I guess pmix/ompi working on top of shared mem.
As I said, Slurm + PMIx work well. Problem comes when OpenMPI comes into play:
Note: If you include verbatim output (or a code block), please use a GitHub Markdown code block like below:
Please, let me know if you need more information.
Thanks so much. Alex