open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.12k stars 857 forks source link

5.0.4 and newer -- LSF Affinity hostfile bug #12794

Open zerothi opened 1 week ago

zerothi commented 1 week ago

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

I am testing this on 5.0.3 vs. 5.0.5 (only 5.0.5 has this problem). I don't have 5.0.4 installed, so I don't know if that is affected, but I am quite confident that this also occurs for 5.0.4 (since the submodule for prrte is the same for 5.0.4 and 5.0.5).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From the sources. A little bit of ompi_info -c info:

 Configure command line: 'CC=gcc' 'CXX=g++' 'FC=gfortran'
                          '--prefix=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-lsf=/lsf/10.1'
                          '--with-lsf-libdir=/lsf/10.1/linux3.10-glibc2.17-x86_64/lib'
                          '--without-tm' '--enable-mpi-fortran=all'
                          '--with-hwloc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--enable-orterun-prefix-by-default'
                          '--with-ucx=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-ucc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-knem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--without-verbs' 'FCFLAGS=-O3 -march=haswell
                          -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32' 'CFLAGS=-O3
                          -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          'CXXFLAGS=-O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          '--with-ofi=no'
                          '--with-libevent=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          'LDFLAGS=-L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt'
                          '--with-xpmem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'

And env-vars:

            Build CFLAGS: -DNDEBUG -O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
                          -finline-functions
           Build FCFLAGS: -O3 -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
           Build LDFLAGS: -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
              Build LIBS: -levent_core -levent_pthreads -lhwloc
                          /tmp/sebo3-gcc-13.3.0-binutils-2.42/openmpi-5.0.3/3rd-party/openpmix/src/libpmix.la

Version numbers are of course different for 5.0.5, otherwise the same.

Please describe the system on which you are running


Details of the problem

The problem relates to the interaction between LSF and OpenMPI.

A couple of issues that are shown here.

Bug introduced between 5.0.3 and 5.0.5

I encounter problems running simple programs (hello-world) in a multinode configuration:

$> bsub -n 8 -R "span[ptile=2]" ... < run.bsub

$> cat run.bsub
...
mpirun --report-bindings a.out

This will run on 4 nodes, each using 2 cores.

Output from:

So the above, indicates some regression for this handling. I tried to backtrack something from prrte, but I am not skilled enough for the logic happening there.

I tracked the submodule hashes of OpenMPI between 5.0.3 and 5.0.4 to these:

So my suspicion is that also 5.0.4 has this.

Now, these things are relatively easily fixed.

I just do:

unset LSB_AFFINITY_HOSTFILE

and rely on cgroups. Then I get the correct behaviour. Correct bindings etc.

By unsetting, I also fallback to the default OpenMPI binding:

Nodes with HW threads

This is likely related to the above, I just put it here for completeness.

As mentioned above I can do unset LSB_AFFINITY_HOSTFILE and get correct bindings.

However, the above works only when there are no HWT.

Here is the same thing for a node with 2 HWT / core (EPYC milan, 32-core/socket in 2-socket)

Only requesting 4 cores here.

If you need more information, let me know!

rhc54 commented 1 week ago

Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration").

Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited.

zerothi commented 1 week ago

As for LSF support... damn...

As for rankfile mapping. I got it through:

hwloc-gather-topology test

I have never done that before, so let me know if that is correct?

(couldn't upload xml files, had to be compressed). test.xml.gz

rhc54 commented 1 week ago

As for LSF support... damn...

Best I can suggest is that you contact IBM through your LSF contract support and point out that if they want OMPI to function on LSF going forward, they probably need to put a little effort into supporting it. 🤷‍♂️

XML looks fine - thanks! Will update as things progress.

sb22bs commented 1 week ago

prrte @ 42169d1cebf75318ced0306172d3a452ece13352 is the last good one, prrte @ f297a9e2eb96c2db9d7756853f56315ea5a127cd seems to break it (at least in our setup).

sb22bs commented 1 week ago

workaround: export HWLOC_ALLOW=all :-)

rhc54 commented 1 week ago

Ouch - I would definitely advise against doing so. It might work for a particular application, but almost certainly will cause breakage in general.

fabiosanger commented 6 days ago

Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration").

Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited.

but the documentation is still suggesting that open MPI could be built with lsf support

rhc54 commented 6 days ago

but the documentation is still suggesting that open MPI could be built with lsf support

Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem.

fabiosanger commented 6 days ago

prrte @ 42169d1cebf75318ced0306172d3a452ece13352

which openmpi release?

fabiosanger commented 6 days ago

but the documentation is still suggesting that open MPI could be built with lsf support

Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem.

thanks you

rhc54 commented 6 days ago

prrte @ 42169d1cebf75318ced0306172d3a452ece13352

which openmpi release?

They seem to indicate that v5.0.3 is working, but all the v5.0.x appear to at least build for them.

zerothi commented 6 days ago

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended :cry:

fabiosanger commented 6 days ago

i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3

./configure --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir="${LSF_LIBDIR}" --with-cuda=/usr/local/cuda --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda LIBS="-levent" LDFLAGS="-L/usr/lib/x86_64-linux-gnu" --prefix=/software/openmpi-4.0.3-cuda

fabiosanger commented 6 days ago

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢

i use tarball to build, could that be the problem?

zerothi commented 6 days ago

i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3

./configure --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir="${LSF_LIBDIR}" --with-cuda=/usr/local/cuda --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda LIBS="-levent" LDFLAGS="-L/usr/lib/x86_64-linux-gnu" --prefix=/software/openmpi-4.0.3-cuda

You could see if there are some things important that we have in the configure step (see initial message).

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢

i use tarball to build, could that be the problem?

I don't know what could be the issue. But I think it shouldn't be cluttered here, rather open a new issue IMHO. This issue is deeper (not a build issue).

fabiosanger commented 6 days ago

I did open a ticket

rhc54 commented 6 days ago

workaround: export HWLOC_ALLOW=all :-)

@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request.

bgoglin commented 6 days ago

workaround: export HWLOC_ALLOW=all :-)

@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request.

This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop).

rhc54 commented 5 days ago

This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop).

@bgoglin Hmmm...we removed this code from PRRTE:

        flags |= HWLOC_TOPOLOGY_FLAG_INCLUDE_DISALLOWED;

because we only want the topology to contain the CPUs the user is allowed to use (note: all CPUs will still be in the complete_cpuset field if we need them - we use the return from hwloc_topology_get_allowed_cpuset). If the topology includes all CPUs (which is what happens when we include the above line of code), then we wind up thinking we can use them, which messes up the mapping/binding algorithm. So what I need is a way of not allowing the user to override that requirement by setting this envar. Might help a particular user in a specific situation, but more generally causes problems.

I'll work out the issue for LSF as a separate problem - we don't see problems elsewhere, so it has something to do with what LSF is doing. My question for you is: how do I ensure the cpuset returned by get_allowed_cpuset only contains allowed CPUs, which is what PRRTE needs?

bgoglin commented 5 days ago

Just ignore this corner-case. @sb22bs said using this envvar is a workaround. It was designed for strange buggy cases, eg when cgroups are misconfigured. I can try to better document that this envvar is bad idea unless you really know what you are doing. Just consider that get_allowed_cpuset() is always correct.