open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 856 forks source link

Internal hwloc shipped with OpenMPI 4.1.7 no longer compatible with SLURM 23.11 cgroup plugin / system hwloc #12470

Open NicoMittenzwey opened 4 months ago

NicoMittenzwey commented 4 months ago

System

AlmaLinux 9.3 OpenMPI 4.1.7 out of HPCX 2.18.0 Nvidia Infiniband NDR Slurm 23.11

Issue

We are running Slurm 23.11 on Alma Linux 9.3 with TaskPlugin=task/affinity,task/cgroup and OpenMPI 4.1.7 from Mellanox / Nvidia HPC-X 2.18.0. When starting jobs with less then the maximum number of processes per node and NOT defining --ntasks-per-node OpenMPI 4.1.7 will crash as it is trying to bind process to cores which are not available to it:

Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:        gpu004
  Application name:  ./hpcx
  Error message:     hwloc_set_cpubind returned "Error" for bitmap "2,114"
  Location:          rtc_hwloc.c:382
--------------------------------------------------------------------------

Workaround

Recompiling OpenMPI and forcing it to use system hwloc resolves this issue (might need a dnf install hwloc-devel):

./configure [...] --with-hwloc=/usr/ && make && make install

bgoglin commented 4 months ago

Might be related to Cgroup v2. This has been supported by hwloc since 2.2 but OMPI 4.1 seems to still have hwloc 2.0.

jsquyres commented 4 months ago

It's unlikely that we'll update the hwloc in Open MPI v4.1.x.

Your workaround is fine (use the system hwloc). You might also want to try bumping up to Open MPI v5.0.x (which will use the system-provided hwloc -- if available -- by default.

NicoMittenzwey commented 4 months ago

Thanks. Yes, actually we also installed Open MPI v5.0.2 in parallel. However, some applications run significant faster using HCOLL but we ran into #10718 with Open MPI v5.

We also try to stick with vendor optimized environments for support reasons and Nvidia HPC-X ships with Open MPI 4.1 using the internal hwloc. So this issue also serves as a documentation of our findings in the hopes, search engines will index it and others don't have to investigate for hours to find the root cause.

phil-blain commented 2 weeks ago

Not sure if it's exactly the same issue, but I'm hitting the same error on our cluster with this configuration:

If there are other jobs running on the node I end up on, I get:

--------------------------------------------------------------------------
Open MPI tried to bind a new process, but something went wrong.  The
process was killed without launching the target application.  Your job
will now abort.

  Local host:        <redacted>
  Application name:  /usr/bin/hostname
  Error message:     hwloc_set_cpubind returned "Error" for bitmap "0"
  Location:          rtc_hwloc.c:382
--------------------------------------------------------------------------

This happens even if I use --ntasks-per-node.

A workaround is to use srun instead of mpirun, or use mpirun --bind-to none.