open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 859 forks source link

5.0.4 and newer -- LSF Affinity hostfile bug #12794

Closed zerothi closed 1 week ago

zerothi commented 2 months ago

Background information

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

I am testing this on 5.0.3 vs. 5.0.5 (only 5.0.5 has this problem). I don't have 5.0.4 installed, so I don't know if that is affected, but I am quite confident that this also occurs for 5.0.4 (since the submodule for prrte is the same for 5.0.4 and 5.0.5).

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From the sources. A little bit of ompi_info -c info:

 Configure command line: 'CC=gcc' 'CXX=g++' 'FC=gfortran'
                          '--prefix=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-lsf=/lsf/10.1'
                          '--with-lsf-libdir=/lsf/10.1/linux3.10-glibc2.17-x86_64/lib'
                          '--without-tm' '--enable-mpi-fortran=all'
                          '--with-hwloc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--enable-orterun-prefix-by-default'
                          '--with-ucx=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-ucc=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--with-knem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          '--without-verbs' 'FCFLAGS=-O3 -march=haswell
                          -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32' 'CFLAGS=-O3
                          -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          'CXXFLAGS=-O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32'
                          '--with-ofi=no'
                          '--with-libevent=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'
                          'LDFLAGS=-L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt'
                          '--with-xpmem=/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92'

And env-vars:

            Build CFLAGS: -DNDEBUG -O3 -march=haswell -mtune=haswell -mavx2
                          -m64  -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
                          -finline-functions
           Build FCFLAGS: -O3 -march=haswell -mtune=haswell -mavx2 -m64 
                          -Wl,-z,max-page-size=0x1000 -O3
                          -Wa,-mbranches-within-32B-boundaries
                          -falign-functions=32 -falign-loops=32
           Build LDFLAGS: -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -Wl,-rpath,/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -Wl,-rpath,/lsf/10.1/linux3.10-glibc2.17-x86_64/lib
                          -lucp  -levent -lhwloc -latomic -llsf -lm -lpthread
                          -lnsl -lrt
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
                          -L/appl9/gcc/13.3.0-binutils-2.42/openmpi/5.0.3-lsf10-alma92/lib
              Build LIBS: -levent_core -levent_pthreads -lhwloc
                          /tmp/sebo3-gcc-13.3.0-binutils-2.42/openmpi-5.0.3/3rd-party/openpmix/src/libpmix.la

Version numbers are of course different for 5.0.5, otherwise the same.

Please describe the system on which you are running


Details of the problem

The problem relates to the interaction between LSF and OpenMPI.

A couple of issues that are shown here.

Bug introduced between 5.0.3 and 5.0.5

I encounter problems running simple programs (hello-world) in a multinode configuration:

$> bsub -n 8 -R "span[ptile=2]" ... < run.bsub

$> cat run.bsub
...
mpirun --report-bindings a.out

This will run on 4 nodes, each using 2 cores.

Output from:

So the above, indicates some regression for this handling. I tried to backtrack something from prrte, but I am not skilled enough for the logic happening there.

I tracked the submodule hashes of OpenMPI between 5.0.3 and 5.0.4 to these:

So my suspicion is that also 5.0.4 has this.

Now, these things are relatively easily fixed.

I just do:

unset LSB_AFFINITY_HOSTFILE

and rely on cgroups. Then I get the correct behaviour. Correct bindings etc.

By unsetting, I also fallback to the default OpenMPI binding:

Nodes with HW threads

This is likely related to the above, I just put it here for completeness.

As mentioned above I can do unset LSB_AFFINITY_HOSTFILE and get correct bindings.

However, the above works only when there are no HWT.

Here is the same thing for a node with 2 HWT / core (EPYC milan, 32-core/socket in 2-socket)

Only requesting 4 cores here.

If you need more information, let me know!

rhc54 commented 2 months ago

Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration").

Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited.

zerothi commented 2 months ago

As for LSF support... damn...

As for rankfile mapping. I got it through:

hwloc-gather-topology test

I have never done that before, so let me know if that is correct?

(couldn't upload xml files, had to be compressed). test.xml.gz

rhc54 commented 2 months ago

As for LSF support... damn...

Best I can suggest is that you contact IBM through your LSF contract support and point out that if they want OMPI to function on LSF going forward, they probably need to put a little effort into supporting it. 🤷‍♂️

XML looks fine - thanks! Will update as things progress.

sb22bs commented 2 months ago

prrte @ 42169d1cebf75318ced0306172d3a452ece13352 is the last good one, prrte @ f297a9e2eb96c2db9d7756853f56315ea5a127cd seems to break it (at least in our setup).

sb22bs commented 2 months ago

workaround: export HWLOC_ALLOW=all :-)

rhc54 commented 2 months ago

Ouch - I would definitely advise against doing so. It might work for a particular application, but almost certainly will cause breakage in general.

fabiosanger commented 2 months ago

Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration").

Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited.

but the documentation is still suggesting that open MPI could be built with lsf support

rhc54 commented 2 months ago

but the documentation is still suggesting that open MPI could be built with lsf support

Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem.

fabiosanger commented 2 months ago

prrte @ 42169d1cebf75318ced0306172d3a452ece13352

which openmpi release?

fabiosanger commented 2 months ago

but the documentation is still suggesting that open MPI could be built with lsf support

Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem.

thanks you

rhc54 commented 2 months ago

prrte @ 42169d1cebf75318ced0306172d3a452ece13352

which openmpi release?

They seem to indicate that v5.0.3 is working, but all the v5.0.x appear to at least build for them.

zerothi commented 2 months ago

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended :cry:

fabiosanger commented 2 months ago

i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3

./configure --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir="${LSF_LIBDIR}" --with-cuda=/usr/local/cuda --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda LIBS="-levent" LDFLAGS="-L/usr/lib/x86_64-linux-gnu" --prefix=/software/openmpi-4.0.3-cuda

fabiosanger commented 2 months ago

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢

i use tarball to build, could that be the problem?

zerothi commented 2 months ago

i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3

./configure --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir="${LSF_LIBDIR}" --with-cuda=/usr/local/cuda --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda LIBS="-levent" LDFLAGS="-L/usr/lib/x86_64-linux-gnu" --prefix=/software/openmpi-4.0.3-cuda

You could see if there are some things important that we have in the configure step (see initial message).

Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢

i use tarball to build, could that be the problem?

I don't know what could be the issue. But I think it shouldn't be cluttered here, rather open a new issue IMHO. This issue is deeper (not a build issue).

fabiosanger commented 2 months ago

I did open a ticket

rhc54 commented 2 months ago

workaround: export HWLOC_ALLOW=all :-)

@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request.

bgoglin commented 2 months ago

workaround: export HWLOC_ALLOW=all :-)

@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request.

This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop).

rhc54 commented 2 months ago

This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop).

@bgoglin Hmmm...we removed this code from PRRTE:

        flags |= HWLOC_TOPOLOGY_FLAG_INCLUDE_DISALLOWED;

because we only want the topology to contain the CPUs the user is allowed to use (note: all CPUs will still be in the complete_cpuset field if we need them - we use the return from hwloc_topology_get_allowed_cpuset). If the topology includes all CPUs (which is what happens when we include the above line of code), then we wind up thinking we can use them, which messes up the mapping/binding algorithm. So what I need is a way of not allowing the user to override that requirement by setting this envar. Might help a particular user in a specific situation, but more generally causes problems.

I'll work out the issue for LSF as a separate problem - we don't see problems elsewhere, so it has something to do with what LSF is doing. My question for you is: how do I ensure the cpuset returned by get_allowed_cpuset only contains allowed CPUs, which is what PRRTE needs?

bgoglin commented 2 months ago

Just ignore this corner-case. @sb22bs said using this envvar is a workaround. It was designed for strange buggy cases, eg when cgroups are misconfigured. I can try to better document that this envvar is bad idea unless you really know what you are doing. Just consider that get_allowed_cpuset() is always correct.

rhc54 commented 2 weeks ago

Not sure I can do much with this one - no access to an LSF machine, so all I can do is poke around a bit. Using the head of PRRTE's master branch along with your topology, I feed it the following LSB hostfile per one of your comments above:

n-62-12-14 6,70
n-62-12-14 7,71
n-62-12-15 6,70
n-62-12-15 7,71

and I get the following corresponding rank file generated:

rank 0=n-62-12-14 slot=12,13
rank 1=n-62-12-14 slot=14,15
rank 2=n-62-12-15 slot=12,13
rank 3=n-62-12-15 slot=14,15

with everything expressed in HWT. I have no idea if that is what you expected/wanted? It certainly didn't segfault, but I confess I'm getting confused with all the various scenarios being covered here, so I'm not sure what you did to get a segfault and/or if that is something that is still seen.

zerothi commented 2 weeks ago

Not sure I can do much with this one - no access to an LSF machine, so all I can do is poke around a bit. Using the head of PRRTE's master branch along with your topology, I feed it the following LSB hostfile per one of your comments above: ...

with everything expressed in HWT. I have no idea if that is what you expected/wanted? It certainly didn't segfault, but I confess I'm getting confused with all the various scenarios being covered here, so I'm not sure what you did to get a segfault and/or if that is something that is still seen.

Seems to look ok, I'll at least return if things gets solved ;) But for sure, I can understand these things are hard to test without the backing LSF machine... :(

rhc54 commented 2 weeks ago

Give PRRTE master branch a try - it might well be that the problem has been fixed, but that it didn't make its way into an OMPI release yet.

sb22bs commented 2 weeks ago

Hi

Somehow it crashed....but couldn't get a stack-trace. Then added --enable-debug, and the compiler (gcc-14.2) throws an error:


Making all in mca/plm
make[2]: Entering directory '/tmp/xyz/prrte/src/mca/plm'
  CC       base/plm_base_frame.lo
  CC       base/plm_base_select.lo
  CC       base/plm_base_receive.lo
  CC       base/plm_base_launch_support.lo
  CC       base/plm_base_jobid.lo
  CC       base/plm_base_prted_cmds.lo
base/plm_base_launch_support.c: In function ‘prte_plm_base_daemon_callback’:
base/plm_base_launch_support.c:1656:42: error: ‘t’ may be used uninitialized [-Werror=maybe-uninitialized]
 1656 |                     dptr->node->topology = t;
      |                     ~~~~~~~~~~~~~~~~~~~~~^~~
base/plm_base_launch_support.c:1311:22: note: ‘t’ was declared here
 1311 |     prte_topology_t *t, *mytopo;
      |                      ^
rhc54 commented 2 weeks ago

Not sure the compiler is right on that one, but I updated the code just in case - pull the repo and it should be fixed.

sb22bs commented 2 weeks ago

versions: - just for reference:

openpmix 92d3473450d3cf9019ef5951e1cc3a1322feb804
prre d5e580a0fe2c4cf893da0cc820fe6d188c2c6069
openmpi-5.0.5
$ gdb /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte/bin/prterun core.3230045
GNU gdb (GDB) 14.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
    <http://www.gnu.org/software/gdb/documentation/>.

For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte/bin/prterun...

warning: core file may not match specified executable file.
[New LWP 3230045]
[New LWP 3230047]
[New LWP 3230048]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte/bin/prter'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0  0x00007f5677b32e1e in prte_rmaps_rf_lsf_convert_affinity_to_rankfile (affinity_file=affinity_file@entry=0x7fff3b27f314 "/zhome/31/b/80425/.lsbatch/1729938715.22929140.hostAffinityFile", 
    aff_rankfile=aff_rankfile@entry=0x7fff3b27a0f8) at rmaps_rank_file.c:847
847             sprintf(cpus[i], "%d", obj->logical_index);
[Current thread is 1 (Thread 0x7f5677442b80 (LWP 3230045))]
(gdb) info stack
#0  0x00007f5677b32e1e in prte_rmaps_rf_lsf_convert_affinity_to_rankfile (affinity_file=affinity_file@entry=0x7fff3b27f314 "/zhome/31/b/80425/.lsbatch/1729938715.22929140.hostAffinityFile", 
    aff_rankfile=aff_rankfile@entry=0x7fff3b27a0f8) at rmaps_rank_file.c:847
#1  0x00007f5677b328b3 in prte_rmaps_rf_process_lsf_affinity_hostfile (jdata=jdata@entry=0x1d7ee7a0, options=options@entry=0x7fff3b27a7e0, 
    affinity_file=affinity_file@entry=0x7fff3b27f314 "/zhome/31/b/80425/.lsbatch/1729938715.22929140.hostAffinityFile") at rmaps_rank_file.c:738
#2  0x00007f5677b2f392 in prte_rmaps_rf_map (jdata=0x1d7ee7a0, options=0x7fff3b27a7e0) at rmaps_rank_file.c:137
#3  0x00007f5677b1928c in prte_rmaps_base_map_job (fd=-1, args=<optimized out>, cbdata=0x1d7f1bd0) at base/rmaps_base_map_job.c:837
#4  0x00007f5677681350 in event_process_active_single_queue (base=base@entry=0x1d578940, activeq=0x1d578d90, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0)
    at event.c:1691
#5  0x00007f56776817d9 in event_process_active (base=base@entry=0x1d578940) at event.c:1783
#6  0x00007f5677681a72 in event_base_loop (base=0x1d578940, flags=flags@entry=1) at event.c:2006
#7  0x0000000000409aca in main (argc=4, argv=0x7fff3b27b658) at prte.c:1185
(gdb) 

This is the content of $LSB_AFFINITY_HOSTFILE

hpc-node(XeonGold6126):n-62-30-30(sebo) $ cat ...../1729938715.22929140.hostAffinityFile
n-62-31-13 16
n-62-31-13 17
n-62-31-13 18
n-62-31-13 19
n-62-31-15 8
n-62-31-15 9
n-62-31-15 10
n-62-31-15 11
n-62-31-8 0
n-62-31-8 1
n-62-31-8 2
n-62-31-8 3
n-62-31-17 0
n-62-31-17 1
n-62-31-17 2
n-62-31-17 3

And this is the binding LSF is reporting via bjobs -l -aff:


 AFFINITY:
                     CPU BINDING                          MEMORY BINDING
                     ------------------------             --------------------
 HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
 n-62-31-13          core   -      -      /1/1/6          -     -    -
 n-62-31-13          core   -      -      /1/1/8          -     -    -
 n-62-31-13          core   -      -      /1/1/9          -     -    -
 n-62-31-13          core   -      -      /1/1/10         -     -    -
 n-62-31-15          core   -      -      /0/0/10         -     -    -
 n-62-31-15          core   -      -      /0/0/11         -     -    -
 n-62-31-15          core   -      -      /0/0/12         -     -    -
 n-62-31-15          core   -      -      /0/0/14         -     -    -
 n-62-31-8           core   -      -      /0/0/0          -     -    -
 n-62-31-8           core   -      -      /0/0/1          -     -    -
 n-62-31-8           core   -      -      /0/0/3          -     -    -
 n-62-31-8           core   -      -      /0/0/4          -     -    -
 n-62-31-17          core   -      -      /0/0/0          -     -    -
 n-62-31-17          core   -      -      /0/0/2          -     -    -
 n-62-31-17          core   -      -      /0/0/3          -     -    -
 n-62-31-17          core   -      -      /0/0/4          -     -    -

The case when it doesn't crash is, when the core starting from "0".

$LSB_AFFINITY_HOSTFILE:

n-62-12-71 0
n-62-12-71 1
n-62-12-71 2
n-62-12-71 3
n-62-12-72 0
n-62-12-72 1
n-62-12-72 2
n-62-12-72 3
n-62-12-73 0
n-62-12-73 1
n-62-12-73 2
n-62-12-73 3
n-62-12-74 0
n-62-12-74 1
n-62-12-74 2
n-62-12-74 3

And this is what LSF is "saying" regarding binding:

AFFINITY:
                    CPU BINDING                          MEMORY BINDING
                    ------------------------             --------------------
HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
n-62-12-71          core   -      -      /0/0/0          -     -    -
n-62-12-71          core   -      -      /0/0/1          -     -    -
n-62-12-71          core   -      -      /0/0/2          -     -    -
n-62-12-71          core   -      -      /0/0/3          -     -    -
n-62-12-72          core   -      -      /0/0/0          -     -    -
n-62-12-72          core   -      -      /0/0/1          -     -    -
n-62-12-72          core   -      -      /0/0/2          -     -    -
n-62-12-72          core   -      -      /0/0/3          -     -    -
n-62-12-73          core   -      -      /0/0/0          -     -    -
n-62-12-73          core   -      -      /0/0/1          -     -    -
n-62-12-73          core   -      -      /0/0/2          -     -    -
n-62-12-73          core   -      -      /0/0/3          -     -    -
n-62-12-74          core   -      -      /0/0/0          -     -    -
n-62-12-74          core   -      -      /0/0/1          -     -    -
n-62-12-74          core   -      -      /0/0/2          -     -    -
n-62-12-74          core   -      -      /0/0/3          -     -    -

I think it's just when prre is not respecting the LSF-binding, then it just crashes..... (unless one is setting this nice HWLOC-environment variable).

rhc54 commented 2 weeks ago

Looks like LSF is counting object indices in some strange way - when we ask for the object of the given number, we get a NULL return indicating that it doesn't exist. I'm guessing that the LSF index is based on including objects that are not available to the user. Sounds like a bug on their side to me. Regardless, there isn't anything we can do with it, so we'll just have to declare LSF unsupported. I can add some protection so we print an error message instead of segfaulting.

Setting that envar will break a number of other things, but so long as you don't encounter them, you might be able to operate. Probably going to be hit/miss, though.

rhc54 commented 2 weeks ago

Suppose one thing you could check - we are expecting those affinity file values to be physical indices. Maybe LSF has changed to providing logical indices? Just grasping at straws here, but it would explain why we aren't finding that object. Otherwise, a little hard to understand why we aren't finding it in the topology - the os_index (physical) shouldn't depend on the available processors.

@bgoglin Any thoughts on what could be the problem? Debating about removing all LSF-related code, but hate to do so if this is something somebody with access might be able to track down and fix.

bgoglin commented 2 weeks ago

I don't understand what the LSF outputs above mean. Is it possible to launch a parallel job with LSF without mpirun and look at where it binds processes? Something like mpirun -np 20 sleep 1000 with LSFrun whatever instead of mpirun.

sb22bs commented 2 weeks ago

Just an example. A longer time ago we used torque/moab and also there we shared nodes betweeen different jobs. So if you have a machine with 32 cores, then 24 cores might bee involved with some MPI across multiple machines, and then maybe there is another 4 core job, and 1 single-core job. And we are doing the same kind of game with LSF. I guess one can also do this with SLURM.

Okay...let's make an example.

Asking for 16 processes, distributed across 4 machines with 4 processes on each machine:

#BSUB -n 16
#BSUB -R "span[ptile=4]"

LSF is handing out "physical" cores, and these one you can then see with "bjobs -l -aff " So in the following example the output is (just with binding to cores, and ignoring memory-affinity).

(This is running on older Skylakes, to they are Dual-Socket with 12 cores per Socket)

The LSF-internal CPU-numbering is a bit weird, but they are also using hwloc somewhere in the background. So it seems that it's really the CPU-core-mapping on the physical CPU-die.

Showing this one now only for the first machine of this mpi-job:

 $ bhosts -aff  n-62-31-15
Host[377G] n-62-31-15
    Socket0
        NUMA[0: 0M / 188.1G]
            core0(*0)
            core2(*1)
            core3(*2)
            core4(*3)
            core5(*4)
            core6(*5)
            core8(*6)
            core9(*7)
            core10(*8)
            core11(*9)
            core12(*10)
            core14(*11)
    Socket1
        NUMA[1: 0M / 188.9G]
            core0(*12)
            core1(*13)
            core2(*14)
            core4(*15)
            core5(*16)
            core6(*17)
            core8(*18)
            core9(*19)
            core10(20)
            core11(21)
            core13(22)
            core14(23)

LSF is always starting counting with logical-core "0" in the (0), but the first on-physical-cpu-die-core doesn't have to be 0.

So LSF-core second-socket, last physical-CPU-die core (on the node n-62-31-15) is: "1/1/14 "

Just to make it complete - then the numactl --hardware output:


 $ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 63527 MB
node 0 free: 52405 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 64457 MB
node 1 free: 50947 MB
node distances:
node   0   1
  0:  10  21
  1:  21  10

And just take the IDS column below as "socket/socket/core-number" in our case here:


AFFINITY:
                    CPU BINDING                          MEMORY BINDING
                    ------------------------             --------------------
HOST                TYPE   LEVEL  EXCL   IDS             POL   NUMA SIZE
n-62-31-15          core   -      -      /1/1/10         -     -    -
n-62-31-15          core   -      -      /1/1/11         -     -    -
n-62-31-15          core   -      -      /1/1/13         -     -    -
n-62-31-15          core   -      -      /1/1/14         -     -    -
n-62-31-16          core   -      -      /1/1/10         -     -    -
n-62-31-16          core   -      -      /1/1/12         -     -    -
n-62-31-16          core   -      -      /1/1/13         -     -    -
n-62-31-16          core   -      -      /1/1/14         -     -    -
n-62-31-7           core   -      -      /1/1/9          -     -    -
n-62-31-7           core   -      -      /1/1/11         -     -    -
n-62-31-7           core   -      -      /1/1/12         -     -    -
n-62-31-7           core   -      -      /1/1/13         -     -    -
n-62-31-8           core   -      -      /0/0/0          -     -    -
n-62-31-8           core   -      -      /0/0/1          -     -    -
n-62-31-8           core   -      -      /0/0/3          -     -    -
n-62-31-8           core   -      -      /0/0/4          -     -    -

So the first core of the job is not necessarily a "core 0", unless one is really asking for a full node, then it's starting with the first core on the first CPU. (unless the first cpu-core on the cpu-die is disabled. Yes....real hardware :-D)

Then the job-ouput....first printing some "debug-info", and then a mini mpirun (using openmpi & the HWLOC-fix). (Would look the same kind of thing if I would use Intel-MPI)

But on this LSF-core-binding-affinity-level...... - we are not using the cpu-physical-die-cores, but the logical core-numbers. So: a machine with 24 cores: first core is 0, last core is 23. So...now back to our mpi-job and mpirun:

LSB_AFFINITY_HOSTFILE:
n-62-31-15 20
n-62-31-15 21
n-62-31-15 22
n-62-31-15 23
n-62-31-16 20
n-62-31-16 21
n-62-31-16 22
n-62-31-16 23
n-62-31-7 20
n-62-31-7 21
n-62-31-7 22
n-62-31-7 23
n-62-31-8 0
n-62-31-8 1
n-62-31-8 2
n-62-31-8 3
LSB_DJOB_HOSTFILE:
n-62-31-15
n-62-31-15
n-62-31-15
n-62-31-15
n-62-31-16
n-62-31-16
n-62-31-16
n-62-31-16
n-62-31-7
n-62-31-7
n-62-31-7
n-62-31-7
n-62-31-8
n-62-31-8
n-62-31-8
n-62-31-8
LSB_DJOB_RANKFILE:
n-62-31-15
n-62-31-15
n-62-31-15
n-62-31-15
n-62-31-16
n-62-31-16
n-62-31-16
n-62-31-16
n-62-31-7
n-62-31-7
n-62-31-7
n-62-31-7
n-62-31-8
n-62-31-8
n-62-31-8
n-62-31-8
LSB_MCPU_HOSTS:
n-62-31-15 4 n-62-31-16 4 n-62-31-7 4 n-62-31-8 4
export HWLOC_ALLOW=all
processor-core is in column 7.
mpirun --display... --report-bindings hostname -s && sleep 1 && PS_PERSONALITY=sgi ps -F \\$\\$

======================   ALLOCATED NODES   ======================
    n-62-31-15: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: n-62-31-15
    n-62-31-16: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: SLOTS_GIVEN
    aliases: NONE
    n-62-31-7: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: SLOTS_GIVEN
    aliases: NONE
    n-62-31-8: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: SLOTS_GIVEN
    aliases: NONE
=================================================================

======================   ALLOCATED NODES   ======================
    n-62-31-15: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: n-62-31-15
    n-62-31-16: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
    aliases: 10.66.31.16,10.66.85.16
    n-62-31-7: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
    aliases: 10.66.31.7,10.66.85.7
    n-62-31-8: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
    aliases: 10.66.31.8,10.66.85.8
=================================================================

======================   ALLOCATED NODES   ======================
    n-62-31-15: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: n-62-31-15
    n-62-31-16: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
    aliases: 10.66.31.16,10.66.85.16
    n-62-31-7: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
    aliases: 10.66.31.7,10.66.85.7
    n-62-31-8: slots=4 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
    aliases: 10.66.31.8,10.66.85.8
=================================================================

=================================   JOB MAP   =================================
Data for JOB prterun-n-62-31-15-1999239@1 offset 0 Total slots allocated 16
Mapper requested: rank_file  Last mapper: rank_file  Mapping policy: BYUSER:NOOVERSUBSCRIBE  Ranking policy: BYUSER
Binding policy: CORE:IF-SUPPORTED  Cpu set: N/A  PPR: N/A  Cpus-per-rank: N/A  Cpu Type: HWT
Num new daemons: 0  New daemon starting vpid INVALID
Num nodes: 4

Data for node: n-62-31-15   State: 3    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:MAPPED:SLOTS_GIVEN
                resolved from n-62-31-15
        Daemon: [prterun-n-62-31-15-1999239@0,0]    Daemon launched: True
            Num slots: 4    Slots in use: 4 Oversubscribed: FALSE
            Num slots allocated: 4  Max slots: 0    Num procs: 4
        Data for proc: [prterun-n-62-31-15-1999239@1,0]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 0
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:20]
        Data for proc: [prterun-n-62-31-15-1999239@1,1]
                Pid: 0  Local rank: 1   Node rank: 1    App rank: 1
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:21]
        Data for proc: [prterun-n-62-31-15-1999239@1,2]
                Pid: 0  Local rank: 2   Node rank: 2    App rank: 2
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:22]
        Data for proc: [prterun-n-62-31-15-1999239@1,3]
                Pid: 0  Local rank: 3   Node rank: 3    App rank: 3
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:23]

Data for node: n-62-31-16   State: 3    Flags: DAEMON_LAUNCHED:MAPPED:SLOTS_GIVEN
                resolved from 10.66.31.16
                resolved from 10.66.85.16
        Daemon: [prterun-n-62-31-15-1999239@0,1]    Daemon launched: True
            Num slots: 4    Slots in use: 4 Oversubscribed: FALSE
            Num slots allocated: 4  Max slots: 0    Num procs: 4
        Data for proc: [prterun-n-62-31-15-1999239@1,4]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 4
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:20]
        Data for proc: [prterun-n-62-31-15-1999239@1,5]
                Pid: 0  Local rank: 1   Node rank: 1    App rank: 5
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:21]
        Data for proc: [prterun-n-62-31-15-1999239@1,6]
                Pid: 0  Local rank: 2   Node rank: 2    App rank: 6
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:22]
        Data for proc: [prterun-n-62-31-15-1999239@1,7]
                Pid: 0  Local rank: 3   Node rank: 3    App rank: 7
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:23]

Data for node: n-62-31-7    State: 3    Flags: DAEMON_LAUNCHED:MAPPED:SLOTS_GIVEN
                resolved from 10.66.31.7
                resolved from 10.66.85.7
        Daemon: [prterun-n-62-31-15-1999239@0,2]    Daemon launched: True
            Num slots: 4    Slots in use: 4 Oversubscribed: FALSE
            Num slots allocated: 4  Max slots: 0    Num procs: 4
        Data for proc: [prterun-n-62-31-15-1999239@1,8]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 8
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:20]
        Data for proc: [prterun-n-62-31-15-1999239@1,9]
                Pid: 0  Local rank: 1   Node rank: 1    App rank: 9
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:21]
        Data for proc: [prterun-n-62-31-15-1999239@1,10]
                Pid: 0  Local rank: 2   Node rank: 2    App rank: 10
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:22]
        Data for proc: [prterun-n-62-31-15-1999239@1,11]
                Pid: 0  Local rank: 3   Node rank: 3    App rank: 11
                State: INITIALIZED  App_context: 0
            Binding: package[1][hwt:23]

Data for node: n-62-31-8    State: 3    Flags: DAEMON_LAUNCHED:MAPPED:SLOTS_GIVEN
                resolved from 10.66.31.8
                resolved from 10.66.85.8
        Daemon: [prterun-n-62-31-15-1999239@0,3]    Daemon launched: True
            Num slots: 4    Slots in use: 4 Oversubscribed: FALSE
            Num slots allocated: 4  Max slots: 0    Num procs: 4
        Data for proc: [prterun-n-62-31-15-1999239@1,12]
                Pid: 0  Local rank: 0   Node rank: 0    App rank: 12
                State: INITIALIZED  App_context: 0
            Binding: package[0][hwt:0]
        Data for proc: [prterun-n-62-31-15-1999239@1,13]
                Pid: 0  Local rank: 1   Node rank: 1    App rank: 13
                State: INITIALIZED  App_context: 0
            Binding: package[0][hwt:1]
        Data for proc: [prterun-n-62-31-15-1999239@1,14]
                Pid: 0  Local rank: 2   Node rank: 2    App rank: 14
                State: INITIALIZED  App_context: 0
            Binding: package[0][hwt:2]
        Data for proc: [prterun-n-62-31-15-1999239@1,15]
                Pid: 0  Local rank: 3   Node rank: 3    App rank: 15
                State: INITIALIZED  App_context: 0
            Binding: package[0][hwt:3]

=============================================================
[1,1]<stdout>: sebo     1999245 1999239  3  4179  3432  21 18:59 ?        R      0:00 /bin/ps -
[1,3]<stdout>: sebo     1999247 1999239  3  4179  3444  23 18:59 ?        R      0:00 /bin/ps -
[1,0]<stdout>: sebo     1999244 1999239  3  4179  3444  20 18:59 ?        R      0:00 /bin/ps -
[1,2]<stdout>: sebo     1999246 1999239  3  4179  3428  22 18:59 ?        R      0:00 /bin/ps -
[1,10]<stdout>: sebo     2875545 2875526  3  4179  3448  22 18:59 ?        R      0:00 /bin/ps -
[1,8]<stdout>: sebo     2875544 2875526  3  4179  3444  20 18:59 ?        R      0:00 /bin/ps -
[1,9]<stdout>: sebo     2875546 2875526  3  4179  3428  21 18:59 ?        R      0:00 /bin/ps -
[1,6]<stdout>: sebo     2672111 2672092  3  4179  3436  22 18:59 ?        R      0:00 /bin/ps -
[1,4]<stdout>: sebo     2672110 2672092  3  4179  3440  20 18:59 ?        R      0:00 /bin/ps -
[1,7]<stdout>: sebo     2672112 2672092  3  4179  3432  23 18:59 ?        R      0:00 /bin/ps -
[1,11]<stdout>: sebo     2875547 2875526  3  4179  3520  23 18:59 ?        R      0:00 /bin/ps -
[1,5]<stdout>: sebo     2672113 2672092  3  4179  3432  21 18:59 ?        R      0:00 /bin/ps -
[1,14]<stdout>: sebo     2581432 2581413  3  4179  3400   2 18:59 ?        R      0:00 /bin/ps -
[1,13]<stdout>: sebo     2581434 2581413  3  4179  3388   1 18:59 ?        R      0:00 /bin/ps -
[1,15]<stdout>: sebo     2581433 2581413  3  4179  3364   3 18:59 ?        R      0:00 /bin/ps -
[1,12]<stdout>: sebo     2581431 2581413  3  4179  3364   0 18:59 ?        R      0:00 /bin/ps -

So if LSF is reporting in the affinity-hostfile with

n-62-31-15 20

that justs means that the 20th core on this machine is reserved as the "first logical core" for the usage for this LSF-job. And LSF is then creating a cpuset and LSF expects that MPI is respecting "this".

So....that's as far as I understand this.

And as "WTF-item": Dell is numbering everything a bit "different", so here is a Dell node:


  $ bhosts -aff  n-62-12-75
Host[503.3G] n-62-12-75
    Socket0
        NUMA[0: 0M / 251.3G]
            core0(0)
            core8(2)
            core1(4)
            core9(6)
            core2(8)
            core10(10)
            core3(12)
            core11(14)
            core4(16)
            core12(18)
            core5(20)
            core13(22)
            core6(24)
            core14(26)
            core7(28)
            core15(30)
    Socket1
        NUMA[1: 0M / 251.9G]
            core0(1)
            core8(3)
            core1(5)
            core9(7)
            core2(9)
            core10(11)
            core3(13)
            core11(15)
            core4(17)
            core12(19)
            core5(21)
            core13(23)
            core6(25)
            core14(27)
            core7(29)
            core15(31)

here the matching numa-output:

#numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 257401 MB
node 0 free: 121271 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 258043 MB
node 1 free: 301 MB
node distances:
node   0   1
  0:  10  20
  1:  20  10

And with openmpi-5.0.3 everthing was still working fine, because it was respecting the cores which LSF was "suggesting" to use.

rhc54 commented 2 weeks ago

The only difference I can see between OMPI v5.0.3 and head of PRRTE master is that the OMPI code still used some "overlay" code instead of just directly calling HWLOC functions (e.g., hwloc_get_obj_by_type). However, I don't see that reflected in the rank_file code - it just calls HWLOC functions.

So I'm not sure why the older version was working and the newer one doesn't. Not that much has changed in the affected areas. Would have to dig into the code and follow the proc placement procedure at an atomistic level to try and see a difference, assuming it must exist (but likely is very subtle). As stated above, the basic problem seems to be that we get an unavailable object (i.e., NULL) returned when using the HWLOC function to obtain the physical core object specified in the affinity file, and there is not much I can do about it from there.

sb22bs commented 2 weeks ago

So... - this is the "troubling patch":

diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index 4a32a7fa1a..055b6dae4b 100644c/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte/includea
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -1865,12 +1865,18 @@ int prte_hwloc_base_topology_set_flags(hwloc_topology_t topology, unsigned long
 {
     if (io) {
 #if HWLOC_API_VERSION < 0x00020000
+        flags |= HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM;
         flags |= HWLOC_TOPOLOGY_FLAG_IO_DEVICES;
 #else
         int ret = hwloc_topology_set_io_types_filter(topology, HWLOC_TYPE_FILTER_KEEP_IMPORTANT);
         if (0 != ret) {
             return ret;
         }
+#    if HWLOC_API_VERSION < 0x00020100
+        flags |= HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM;
+#    else
+        flags |= HWLOC_TOPOLOGY_FLAG_INCLUDE_DISALLOWED;
+#    endif
 #endif
     }
     // Blacklist the "gl" component due to potential conflicts.

We are using here a "fresh" hwloc, so it's the third case for us:

Here the example just using hwloc:

LSB_AFFINITY_HOSTFILE:
n-62-31-15 20
n-62-31-15 21
n-62-31-15 22
n-62-31-15 23
n-62-31-16 20
n-62-31-16 21
n-62-31-16 22
n-62-31-16 23
n-62-31-7 20
n-62-31-7 21
n-62-31-7 22
n-62-31-7 23
n-62-31-17 0
n-62-31-17 1
n-62-31-17 2
n-62-31-17 3

Loaded dependency [mpi/5.0.3-gcc-14.1.0-binutils-2.42]: gcc/14.1.0-binutils-2.42
Loaded module: mpi/5.0.3-gcc-14.1.0-binutils-2.42

# the good usecase with and older mpi which works
# and just gives the "correct" cores.
Loading mpi/5.0.3-gcc-14.1.0-binutils-2.42
  Loading requirement: gcc/14.1.0-binutils-2.42
hwloc-info 2.10.0
hwloc-ls --no-io --filter core:important
Machine (377GB total)
  Package L#0
    NUMANode L#0 (P#0 188GB)
  Package L#1
    NUMANode L#1 (P#1 189GB)
    L3 L#0 (19MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#20)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#21)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#22)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#23)

Loaded dependency [mpi/5.0.5-gcc-14.2.0-binutils-2.43-sebotest1]: gcc/14.2.0-binutils-2.43
Loaded module: mpi/5.0.5-gcc-14.2.0-binutils-2.43-sebotest1

# and this newer mpi with include-disallowed gives cores which are not in LSF's cpuset:

Loading mpi/5.0.5-gcc-14.2.0-binutils-2.43-sebotest1
  Loading requirement: gcc/14.2.0-binutils-2.43
hwloc-info 2.11.1
hwloc-ls --no-io --disallowed --filter core:important
Machine (377GB total)
  Package L#0
    NUMANode L#0 (P#0 188GB)
    L3 L#0 (19MB)
      L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
      L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
      L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
      L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
      L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
      L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
      L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
      L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
      L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
      L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
      L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
      L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
  Package L#1
    NUMANode L#1 (P#1 189GB)
    L3 L#1 (19MB)
      L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
      L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
      L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
      L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
      L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
      L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
      L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
      L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
      L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
      L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
      L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
      L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
rhc54 commented 2 weeks ago

I don't think that's quite correct - here is what that section of code looks like in PRRTE master branch:

int prte_hwloc_base_topology_set_flags(hwloc_topology_t topology, unsigned long flags, bool io)
{
    if (io) {
#if HWLOC_API_VERSION < 0x00020000
        flags |= HWLOC_TOPOLOGY_FLAG_IO_DEVICES;
#else
        int ret = hwloc_topology_set_io_types_filter(topology, HWLOC_TYPE_FILTER_KEEP_IMPORTANT);
        if (0 != ret) {
            return ret;
        }
#endif
    }
    // Blacklist the "gl" component due to potential conflicts.
    // See "https://github.com/open-mpi/ompi/issues/10025" for
    // an explanation
#ifdef HWLOC_VERSION_MAJOR
#if HWLOC_VERSION_MAJOR > 2
    hwloc_topology_set_components(topology, HWLOC_TOPOLOGY_COMPONENTS_FLAG_BLACKLIST, "gl");
#elif HWLOC_VERSION_MAJOR == 2 && HWLOC_VERSION_MINOR >= 1
    hwloc_topology_set_components(topology, HWLOC_TOPOLOGY_COMPONENTS_FLAG_BLACKLIST, "gl");
#endif
#endif
    return hwloc_topology_set_flags(topology, flags);
}

We removed the lines you cite some time ago as we switched to using HWLOC's "allowed cpuset" function. However, you have stated that PRRTE master continues to fail for LSF - which means that there is some other cause.

rhc54 commented 1 week ago

Been some discussion over here about what to do with this given that nobody over here has access to an appropriate system. Best I can determine, the rank_file code is working correctly when given a correct rankfile. The issue therefore seems to lie in the conversion of the LSF affinity file to a PRRTE rankfile. In some situations, when we ask for the HWLOC object corresponding to the LSF physical core ID, we get a NULL return indicating that the specified object is not available to us.

It isn't clear if the problem lies in LSF (either in LSF itself or in your local setup?), in HWLOC (perhaps not correctly parsing the bound topology within the allocation?), or in the way we are using HWLOC (maybe in the flags we pass when reading the topology?). We had a couple of suggestions:

I don't know if @bgoglin has any other thoughts or suggestions. I'm afraid I'm somewhat stuck at this point.

sb22bs commented 1 week ago

Here are some hwloc-xml-files:

good.and.bad.tar.gz

for the "working" variant the cpusets are all the same on all nodes, and the "crashing" variant has (at least?) two different kind of cpusets.

rhc54 commented 1 week ago

Ah - well that at least partially explains the problem. The code explicitly states that it assumes a homogeneous system. Been that way from the very beginning. I'll try to find some time to look at it, but make no promises.

rhc54 commented 1 week ago

Can you give this a try? Apply to PRRTE master branch:

diff --git a/src/mca/rmaps/rank_file/rmaps_rank_file.c b/src/mca/rmaps/rank_file/rmaps_rank_file.c
index d1a2401a41..acaf73aaa6 100644
--- a/src/mca/rmaps/rank_file/rmaps_rank_file.c
+++ b/src/mca/rmaps/rank_file/rmaps_rank_file.c
@@ -756,6 +756,28 @@ static int prte_rmaps_rf_process_lsf_affinity_hostfile(prte_job_t *jdata,
     return PRTE_SUCCESS;
 }

+static bool quickmatch(prte_node_t *nd, char *name)
+{
+    int n;
+
+    if (0 == strcmp(nd->name, name)) {
+        return true;
+    }
+    if (0 == strcmp(nd->name, prte_process_info.nodename) &&
+        (0 == strcmp(name, "localhost") ||
+         0 == strcmp(name, "127.0.0.1"))) {
+        return true;
+    }
+    if (NULL != nd->aliases) {
+        for (n=0; NULL != nd->aliases[n]; n++) {
+            if (0 == strcmp(nd->aliases[n], name)) {
+                return true;
+            }
+        }
+    }
+    return false;
+}
+
 static int prte_rmaps_rf_lsf_convert_affinity_to_rankfile(char *affinity_file, char **aff_rankfile)
 {
     FILE *fp;
@@ -765,9 +787,9 @@ static int prte_rmaps_rf_lsf_convert_affinity_to_rankfile(char *affinity_file, c
     char *tmp_str = NULL;
     size_t len;
     char **cpus;
-    int i;
+    int i, j;
     hwloc_obj_t obj;
-    prte_topology_t *my_topo = NULL;
+    prte_node_t *node, *nptr;

     if( NULL != *aff_rankfile) {
         free(*aff_rankfile);
@@ -835,11 +857,33 @@ static int prte_rmaps_rf_lsf_convert_affinity_to_rankfile(char *affinity_file, c
         // Convert the Physical CPU set from LSF to a Hwloc logical CPU set
         pmix_output_verbose(20, prte_rmaps_base_framework.framework_output,
                             "mca:rmaps:rf: (lsf) Convert Physical CPUSET from <%s>", sep);
-        my_topo = (prte_topology_t *) pmix_pointer_array_get_item(prte_node_topologies, 0);
+
+        // find the named host
+        nptr = NULL;
+        for (j = 0; j < prte_node_pool->size; j++) {
+            node = (prte_node_t *) pmix_pointer_array_get_item(prte_node_pool, j);
+            if (NULL == node) {
+                continue;
+            }
+            if (quickmatch(node, hstname)) {
+                nptr = node;
+                break;
+            }
+        }
+        if (NULL == nptr) {
+            /* wasn't found - that is an error */
+            pmix_show_help("help-rmaps_rank_file.txt",
+                           "resource-not-found", true,
+                           hstname);
+            fclose(fp);
+            close(fp_rank);
+            return PRTE_ERROR;
+        }
+
         cpus = PMIX_ARGV_SPLIT_COMPAT(sep, ',');
         for(i = 0; NULL != cpus[i]; ++i) {
-            // assume HNP has the same topology as other nodes
-            obj = hwloc_get_pu_obj_by_os_index(my_topo->topo, strtol(cpus[i], NULL, 10)) ;
+            // get the specified object
+            obj = hwloc_get_pu_obj_by_os_index(nptr->topology->topo, strtol(cpus[i], NULL, 10)) ;
             if (NULL == obj) {
                 PMIX_ARGV_FREE_COMPAT(cpus);
                 fclose(fp);
sb22bs commented 1 week ago

openpmix 3ecdbf32c5dc77beb066c8683df49648cb920804 prrte bc3c11e76a4928062ada6c423906ab5ad3b758e9 openmpi-5.0.5

It works....thanks a lot....but now I'm crashing in pmix....(after a while of running the petsc-stream-benchmark, the output is:

$ cat scaling.log
1  15043.8528   Rate (MB/s)
2  27858.5442   Rate (MB/s) 1.85182 
3  42371.2441   Rate (MB/s) 2.81651 
4  55388.2854   Rate (MB/s) 3.68178 
5  67671.5669   Rate (MB/s) 4.49827 
6  71739.3803   Rate (MB/s) 4.76867 
7  77762.7681   Rate (MB/s) 5.16906 
8  90678.2054   Rate (MB/s) 6.02757 

<crash>

Here is one of the stack-traces:

(gdb) info stack
#0  0x00007f011a88b94c in __pthread_kill_implementation () from /lib64/libc.so.6
#1  0x00007f011a83e646 in raise () from /lib64/libc.so.6
#2  0x00007f011a828885 in abort () from /lib64/libc.so.6
#3  0x00007f011a82871b in __assert_fail_base.cold () from /lib64/libc.so.6
#4  0x00007f011a837386 in __assert_fail () from /lib64/libc.so.6
#5  0x00007f011b02098d in pmix_gds_base_store_modex (buff=buff@entry=0x7f011a5eeab0, cb_fn=cb_fn@entry=0x7f011b05b740 <_hash_store_modex>, cbdata=cbdata@entry=0x7f0114030c70) at base/gds_base_fns.c:149
#6  0x00007f011b05b71c in hash_store_modex (buf=0x7f011a5eeab0, cbdata=0x7f0114030c70) at gds_hash.c:1328
#7  0x00007f011ae888fa in _mdxcbfunc (sd=-1, args=args@entry=4, cbdata=0x3acd2440) at server/pmix_server.c:3679
#8  0x00007f011ad61943 in event_process_active_single_queue (base=base@entry=0x3acbb6d0, activeq=activeq@entry=0x3acbbb20, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0) at event.c:1691
#9  0x00007f011ad61e8f in event_process_active (base=base@entry=0x3acbb6d0) at event.c:1783
#10 0x00007f011ad628d7 in event_base_loop (base=0x3acbb6d0, flags=flags@entry=1) at event.c:2006
#11 0x00007f011aee8d8c in progress_engine (obj=0x3ad57d18) at runtime/pmix_progress_threads.c:110
#12 0x00007f011a889c02 in start_thread () from /lib64/libc.so.6
#13 0x00007f011a90ec40 in clone3 () from /lib64/libc.so.6
sb22bs commented 1 week ago

sorry. Forgot to add the last words:

[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive registered command from [prterun-n-62-12-60-83080@0,3]
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive got registered for job prterun-n-62-12-60-83080@1
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive got registered for vpid 12
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive got registered for vpid 13
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive got registered for vpid 14
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive done processing commands
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:launch prterun-n-62-12-60-83080@1 registered
[n-62-12-60:83080] PMIX ERROR: PMIX_ERR_UNPACK_READ_PAST_END_OF_BUFFER in file base/gds_base_fns.c at line 148
prterun: base/gds_base_fns.c:149: pmix_gds_base_store_modex: Assertion `PMIX_OBJ_MAGIC_ID == ((pmix_object_t *) (&bkt))->obj_magic_id' failed.
rhc54 commented 1 week ago

No idea what that app does, but odd that it would crash after running for awhile. The referenced operation takes place during MPI_Init. Would have to think a bit about it, but any further info about the run that failed (like what is different relative to the runs that worked) would help.

sb22bs commented 1 week ago

So...I have also recompiled petsc now. It's the same problem, but also discovered now the "bad prefix" messages.

[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got local launch complete for vpid 11 state RUNNING
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive done processing commands
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:launch wiring up iof for job prterun-n-62-12-60-93112@1
[n-62-12-60:93120] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-60:93117] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-60:93119] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-60:93118] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-61:80262] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-61:80260] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-61:80261] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-61:80263] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-63:64744] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-63:64742] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-63:64743] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-63:64745] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-62:88964] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-62:88962] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-62:88965] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-62:88963] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
    pmix_mca_
or
    libpmix_mca_
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive processing msg
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive registered command from [prterun-n-62-12-60-93112@0,1]
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for job prterun-n-62-12-60-93112@1
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 4
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 5
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 6
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 7
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive done processing commands
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive processing msg
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive registered command from [prterun-n-62-12-60-93112@0,3]
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for job prterun-n-62-12-60-93112@1
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 12
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 13
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 14
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 15
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive done processing commands
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive processing msg
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive registered command from [prterun-n-62-12-60-93112@0,2]
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for job prterun-n-62-12-60-93112@1
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 8
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 9
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 10
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 11
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive done processing commands
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:launch prterun-n-62-12-60-93112@1 registered
[n-62-12-60:93112] PMIX ERROR: PMIX_ERR_UNPACK_READ_PAST_END_OF_BUFFER in file base/gds_base_fns.c at line 148
prterun: base/gds_base_fns.c:149: pmix_gds_base_store_modex: Assertion `PMIX_OBJ_MAGIC_ID == ((pmix_object_t *) (&bkt))->obj_magic_id' failed.
------------------------------------------------
See graph in the file src/benchmarks/streams/MPIscaling.png

I think I have to re-do the stuff in a clean way again, just to make sure I haven't created some non-debuggable-mess.

sb22bs commented 1 week ago

I have patched now a prrte-3.0.6 with your patch and using pmix-5.0.3 and openmpi-5.0.5 and mpi across "different nodes" works without crashing. So the above PMIX-unpack-error seems to be unrelated to this (LSF-related) problem. So maybe a slurm-user can give this one a try?

rhc54 commented 1 week ago

Not sure what it would have to do with Slurm, but I agree it is also unlikely to relate to LSF either. Will follow up with the change to PRRTE. Thanks for the assist!