open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.18k stars 861 forks source link

Install fail on external PMIx, I get "error: 'PMIX_SESSION_PROVISION' undeclared" #12915

Open DaXor-0 opened 2 weeks ago

DaXor-0 commented 2 weeks ago

For a university project I'm trying to build a rasberry pi cluster with slurm.

I've had quite a few issues on trying to run srun with mpi and I've settled to install openmpi from git repo specifying external pmix, hwloc and libevent for pmix/slurm integration.

I'm building openmpi version: 5.1.0a1 on a raspberry pi 5 cluster managed with slurm. (nodes have raspberry pi os lite)

What I've done so far:

HWLOC (v2.11 git clone)---> ./configure --disable-rsmi --prefix=/hwloc-install-prefix make make install

LIBEVENT (latest git clone)---> ./configure --prefix==/libevent-install-prefix make make install

OPENPMIX (latest git clone)---> ./configure --with-slurm --with-libevent=/libevent-install-prefix --with-hwloc=/hwloc-install-prefix --prefix=/pmix-install-prefix make make install

OPENMPI (latest git clone)---> ./configure --disable-sphinx --with-slurm --with-libevent=libevent-install-prefix --with-hwloc=hwloc-install-prefix --with-pmi=pmix-install-prefix --prefix=ompi-prefix make --------> I fail here

(note that I'm disabling sphinx because I've not yet installed a python module on the cluster)

The output of pmix configure correctly indicates slurm support and the paths to external libevent and hwloc. Also the output of ompi configure correctly indicates pmi, libevent and hwloc as external.

When I try to run openmpi make I'm not able to build it for this error:

In file included from /clusterfs/apps/openpmix/include/pmix_common.h:2797,
                 from /clusterfs/apps/openpmix/include/pmix/src/class/pmix_list.h:78,
                 from /clusterfs/src/ompi/3rd-party/prrte/src/pmix/pmix-internal.h:26,
                 from prted/pmix/pmix_server_session.c:12:
prted/pmix/pmix_server_session.c: In function 'process_directive':
prted/pmix/pmix_server_session.c:145:50: error: 'PMIX_SESSION_PROVISION' undeclared (first use in this function); did you mean 'PMIX_SESSION_PROVISION_NODES'?
  145 |         } else if (PMIX_CHECK_KEY(&req->info[n], PMIX_SESSION_PROVISION) ||
      |                                                  ^~~~~~~~~~~~~~~~~~~~~~
/clusterfs/apps/openpmix/include/pmix_deprecated.h:497:30: note: in definition of macro 'PMIX_CHECK_KEY'
  497 |     PMIx_Check_key((a)->key, b)
      |                              ^
prted/pmix/pmix_server_session.c:145:50: note: each undeclared identifier is reported only once for each function it appears in
  145 |         } else if (PMIX_CHECK_KEY(&req->info[n], PMIX_SESSION_PROVISION) ||
      |                                                  ^~~~~~~~~~~~~~~~~~~~~~
/clusterfs/apps/openpmix/include/pmix_deprecated.h:497:30: note: in definition of macro 'PMIX_CHECK_KEY'
  497 |     PMIx_Check_key((a)->key, b)
      |                              ^
prted/pmix/pmix_server_session.c: At top level:
prted/pmix/pmix_server_session.c:416:1: fatal error: opening dependency file prted/pmix/.deps/libprrte_la-pmix_server_session.Tpo: Permission denied
  416 | }
      | ^
compilation terminated.
make[4]: *** [Makefile:1655: prted/pmix/libprrte_la-pmix_server_session.lo] Error 1
make[4]: *** Waiting for unfinished jobs....
make[4]: Leaving directory '/clusterfs/src/ompi/3rd-party/prrte/src'
make[3]: *** [Makefile:1862: all-recursive] Error 1
make[3]: Leaving directory '/clusterfs/src/ompi/3rd-party/prrte/src'
make[2]: *** [Makefile:795: all-recursive] Error 1
make[2]: Leaving directory '/clusterfs/src/ompi/3rd-party/prrte'
make[1]: *** [Makefile:1385: all-recursive] Error 1
make[1]: Leaving directory '/clusterfs/src/ompi/3rd-party'
make: *** [Makefile:1512: all-recursive] Error 1
rhc54 commented 2 weeks ago

This option isn't correct: with-pmi=pmix-install-prefix should be with-pmix. The output indicates you picked up some other version of PMIx that doesn't include some of the definitions to be found in upstream PMIx master branch.

DaXor-0 commented 2 weeks ago

I get the same error

this is the config I ran

./configure --with-slurm --disable-sphinx --with-pmix=/clusterfs/apps/openpmix --with-hwloc=/clusterfs/apps/hwloc --with-libevent=/clusterfs/apps/libevent --prefix=/clusterfs/apps/openmpi

And this is the config output

Open MPI configuration:
-----------------------
Version: 5.1.0a1
MPI Standard Version: 3.1
Build MPI C bindings: yes
Build MPI Fortran bindings: no                          Build MPI Java bindings (experimental): no
Build Open SHMEM support: false (no spml)
Debug build: no
Platform file: (none)
Miscellaneous
-----------------------
Atomics: GCC built-in style atomics
Fault Tolerance support: mpi
HTML docs and man pages: no documentation available
hwloc: external
libevent: external
Open UCC: no
pmix: external
PRRTE: internal
Threading Package: pthreads

Transports
-----------------------                                 Cisco usNIC: no
Intel Omnipath (PSM2): no (not found)                   Open UCX: no
OpenFabrics OFI Libfabric: no (not found)               Portals4: no (not found)
Shared memory/copy in+copy out: yes
Shared memory/Linux CMA: yes
Shared memory/Linux KNEM: no
Shared memory/XPMEM: no
TCP: yes

Accelerators
-----------------------
CUDA support: no
Intel ZE support: no                                    ROCm support: no

OMPIO File Systems                                      -----------------------
DDN Infinite Memory Engine: no
Generic Unix FS: yes
IBM Spectrum Scale/GPFS: no (not found)
Lustre: no (not found)
rhc54 commented 2 weeks ago

Afraid I cannot help you much - something is quite wrong here. You should not be able to configure with an external hwloc, libevent, and pmix - and then use an internal PRRTE. Configure is supposed to error out on that attempt as all must be either internal or all must be external.

Setting that weirdness aside, I can only tell you that you are not in fact building against a head of the PMIx master branch. I don't know if you incorrectly checked out some other branch, or have some older PMIx install on your system, or...? I only know that PRRTE is looking at an old version of PMIx, which is what is causing the error.

DaXor-0 commented 2 weeks ago

Ok, thanks for the advice.

My hypothesis is that something strange is going on due to the fact that I'm on ARM and something somewhere is breaking for this reason

rhc54 commented 2 weeks ago

Doubt that it has anything to do with ARM as many of us (myself included) operate regularly on that hardware. You should check to see if you have another PMIx install somewhere on the system that is causing the confusion. Try building everything internal (instead of using the external libs) and see if that works. Etc.