Closed zerothi closed 1 week ago
Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration").
Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited.
As for LSF support... damn...
As for rankfile mapping. I got it through:
hwloc-gather-topology test
I have never done that before, so let me know if that is correct?
(couldn't upload xml files, had to be compressed). test.xml.gz
As for LSF support... damn...
Best I can suggest is that you contact IBM through your LSF contract support and point out that if they want OMPI to function on LSF going forward, they probably need to put a little effort into supporting it. 🤷♂️
XML looks fine - thanks! Will update as things progress.
prrte @ 42169d1cebf75318ced0306172d3a452ece13352 is the last good one, prrte @ f297a9e2eb96c2db9d7756853f56315ea5a127cd seems to break it (at least in our setup).
workaround: export HWLOC_ALLOW=all :-)
Ouch - I would definitely advise against doing so. It might work for a particular application, but almost certainly will cause breakage in general.
Not terribly surprising - LSF support was transferred to IBM, which subsequently left the OMPI/PRRTE projects. So nobody has been supporting LSF stuff, and being honest, nobody has access to an LSF system. So I'm not sure I see a clear path forward here - might be that LSF support is coming to an end, or is at least somewhat modified/limited (perhaps need to rely solely on rankfile and use of revised cmd line options - may not be able to utilize LSF "integration").
Rankfile mapping support is a separate issue that has nothing to do with LSF, so that can perhaps be investigated if you can provide your topology file in an HWLOC XML format. Will likely take awhile to get a fix as support time is quite limited.
but the documentation is still suggesting that open MPI could be built with lsf support
but the documentation is still suggesting that open MPI could be built with lsf support
Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem.
prrte @ 42169d1cebf75318ced0306172d3a452ece13352
which openmpi release?
but the documentation is still suggesting that open MPI could be built with lsf support
Should probably be updated to indicate that it is no longer being tested, and so may or may not work. However, these other folks apparently were able to build on LSF, so I suspect this is more likely to be a local problem.
thanks you
prrte @ 42169d1cebf75318ced0306172d3a452ece13352
which openmpi release?
They seem to indicate that v5.0.3 is working, but all the v5.0.x appear to at least build for them.
Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended :cry:
i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3
./configure --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir="${LSF_LIBDIR}" --with-cuda=/usr/local/cuda --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda LIBS="-levent" LDFLAGS="-L/usr/lib/x86_64-linux-gnu" --prefix=/software/openmpi-4.0.3-cuda
Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢
i use tarball to build, could that be the problem?
i tried all 5.0.x versions, but it wont pass the configure. i managed to build it with 4.0.3
./configure --with-lsf=/usr/local/lsf/10.1 --with-lsf-libdir="${LSF_LIBDIR}" --with-cuda=/usr/local/cuda --enable-mca-dso=btl-smcuda,rcache-rgpusm,rcache-gpusm,accelerator-cuda LIBS="-levent" LDFLAGS="-L/usr/lib/x86_64-linux-gnu" --prefix=/software/openmpi-4.0.3-cuda
You could see if there are some things important that we have in the configure step (see initial message).
Yes, everything builds. So that isn't the issue. It is rather that it isn't working as intended 😢
i use tarball to build, could that be the problem?
I don't know what could be the issue. But I think it shouldn't be cluttered here, rather open a new issue IMHO. This issue is deeper (not a build issue).
I did open a ticket
workaround: export HWLOC_ALLOW=all :-)
@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request.
workaround: export HWLOC_ALLOW=all :-)
@bgoglin This implies that the envar is somehow overriding the flags we pass into the topology discover API - is that true? Just wondering how we ensure that hwloc is accurately behaving as we request.
This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop).
This envvar disables the gathering of things like cgroups (I'll check if I need to clarify that). It doesn't override something configured in the API (like changing topology flags to include/exclude cgroup-disabled resources), but rather considers that everything is enabled in cgroups (hence the INCLUDE_DISALLOWED flag becomes a noop).
@bgoglin Hmmm...we removed this code from PRRTE:
flags |= HWLOC_TOPOLOGY_FLAG_INCLUDE_DISALLOWED;
because we only want the topology to contain the CPUs the user is allowed to use (note: all CPUs will still be in the complete_cpuset field if we need them - we use the return from hwloc_topology_get_allowed_cpuset
). If the topology includes all CPUs (which is what happens when we include the above line of code), then we wind up thinking we can use them, which messes up the mapping/binding algorithm. So what I need is a way of not allowing the user to override that requirement by setting this envar. Might help a particular user in a specific situation, but more generally causes problems.
I'll work out the issue for LSF as a separate problem - we don't see problems elsewhere, so it has something to do with what LSF is doing. My question for you is: how do I ensure the cpuset returned by get_allowed_cpuset
only contains allowed CPUs, which is what PRRTE needs?
Just ignore this corner-case. @sb22bs said using this envvar is a workaround. It was designed for strange buggy cases, eg when cgroups are misconfigured. I can try to better document that this envvar is bad idea unless you really know what you are doing. Just consider that get_allowed_cpuset() is always correct.
Not sure I can do much with this one - no access to an LSF machine, so all I can do is poke around a bit. Using the head of PRRTE's master branch along with your topology, I feed it the following LSB hostfile per one of your comments above:
n-62-12-14 6,70
n-62-12-14 7,71
n-62-12-15 6,70
n-62-12-15 7,71
and I get the following corresponding rank file generated:
rank 0=n-62-12-14 slot=12,13
rank 1=n-62-12-14 slot=14,15
rank 2=n-62-12-15 slot=12,13
rank 3=n-62-12-15 slot=14,15
with everything expressed in HWT. I have no idea if that is what you expected/wanted? It certainly didn't segfault, but I confess I'm getting confused with all the various scenarios being covered here, so I'm not sure what you did to get a segfault and/or if that is something that is still seen.
Not sure I can do much with this one - no access to an LSF machine, so all I can do is poke around a bit. Using the head of PRRTE's master branch along with your topology, I feed it the following LSB hostfile per one of your comments above: ...
with everything expressed in HWT. I have no idea if that is what you expected/wanted? It certainly didn't segfault, but I confess I'm getting confused with all the various scenarios being covered here, so I'm not sure what you did to get a segfault and/or if that is something that is still seen.
Seems to look ok, I'll at least return if things gets solved ;) But for sure, I can understand these things are hard to test without the backing LSF machine... :(
Give PRRTE master branch a try - it might well be that the problem has been fixed, but that it didn't make its way into an OMPI release yet.
Hi
Somehow it crashed....but couldn't get a stack-trace.
Then added --enable-debug
, and the compiler (gcc-14.2) throws an error:
Making all in mca/plm
make[2]: Entering directory '/tmp/xyz/prrte/src/mca/plm'
CC base/plm_base_frame.lo
CC base/plm_base_select.lo
CC base/plm_base_receive.lo
CC base/plm_base_launch_support.lo
CC base/plm_base_jobid.lo
CC base/plm_base_prted_cmds.lo
base/plm_base_launch_support.c: In function ‘prte_plm_base_daemon_callback’:
base/plm_base_launch_support.c:1656:42: error: ‘t’ may be used uninitialized [-Werror=maybe-uninitialized]
1656 | dptr->node->topology = t;
| ~~~~~~~~~~~~~~~~~~~~~^~~
base/plm_base_launch_support.c:1311:22: note: ‘t’ was declared here
1311 | prte_topology_t *t, *mytopo;
| ^
Not sure the compiler is right on that one, but I updated the code just in case - pull the repo and it should be fixed.
versions: - just for reference:
openpmix 92d3473450d3cf9019ef5951e1cc3a1322feb804
prre d5e580a0fe2c4cf893da0cc820fe6d188c2c6069
openmpi-5.0.5
$ gdb /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte/bin/prterun core.3230045
GNU gdb (GDB) 14.2
Copyright (C) 2023 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.
Type "show copying" and "show warranty" for details.
This GDB was configured as "x86_64-pc-linux-gnu".
Type "show configuration" for configuration details.
For bug reporting instructions, please see:
<https://www.gnu.org/software/gdb/bugs/>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
For help, type "help".
Type "apropos word" to search for commands related to "word"...
Reading symbols from /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte/bin/prterun...
warning: core file may not match specified executable file.
[New LWP 3230045]
[New LWP 3230047]
[New LWP 3230048]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `/appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte/bin/prter'.
Program terminated with signal SIGSEGV, Segmentation fault.
#0 0x00007f5677b32e1e in prte_rmaps_rf_lsf_convert_affinity_to_rankfile (affinity_file=affinity_file@entry=0x7fff3b27f314 "/zhome/31/b/80425/.lsbatch/1729938715.22929140.hostAffinityFile",
aff_rankfile=aff_rankfile@entry=0x7fff3b27a0f8) at rmaps_rank_file.c:847
847 sprintf(cpus[i], "%d", obj->logical_index);
[Current thread is 1 (Thread 0x7f5677442b80 (LWP 3230045))]
(gdb) info stack
#0 0x00007f5677b32e1e in prte_rmaps_rf_lsf_convert_affinity_to_rankfile (affinity_file=affinity_file@entry=0x7fff3b27f314 "/zhome/31/b/80425/.lsbatch/1729938715.22929140.hostAffinityFile",
aff_rankfile=aff_rankfile@entry=0x7fff3b27a0f8) at rmaps_rank_file.c:847
#1 0x00007f5677b328b3 in prte_rmaps_rf_process_lsf_affinity_hostfile (jdata=jdata@entry=0x1d7ee7a0, options=options@entry=0x7fff3b27a7e0,
affinity_file=affinity_file@entry=0x7fff3b27f314 "/zhome/31/b/80425/.lsbatch/1729938715.22929140.hostAffinityFile") at rmaps_rank_file.c:738
#2 0x00007f5677b2f392 in prte_rmaps_rf_map (jdata=0x1d7ee7a0, options=0x7fff3b27a7e0) at rmaps_rank_file.c:137
#3 0x00007f5677b1928c in prte_rmaps_base_map_job (fd=-1, args=<optimized out>, cbdata=0x1d7f1bd0) at base/rmaps_base_map_job.c:837
#4 0x00007f5677681350 in event_process_active_single_queue (base=base@entry=0x1d578940, activeq=0x1d578d90, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0)
at event.c:1691
#5 0x00007f56776817d9 in event_process_active (base=base@entry=0x1d578940) at event.c:1783
#6 0x00007f5677681a72 in event_base_loop (base=0x1d578940, flags=flags@entry=1) at event.c:2006
#7 0x0000000000409aca in main (argc=4, argv=0x7fff3b27b658) at prte.c:1185
(gdb)
This is the content of $LSB_AFFINITY_HOSTFILE
hpc-node(XeonGold6126):n-62-30-30(sebo) $ cat ...../1729938715.22929140.hostAffinityFile
n-62-31-13 16
n-62-31-13 17
n-62-31-13 18
n-62-31-13 19
n-62-31-15 8
n-62-31-15 9
n-62-31-15 10
n-62-31-15 11
n-62-31-8 0
n-62-31-8 1
n-62-31-8 2
n-62-31-8 3
n-62-31-17 0
n-62-31-17 1
n-62-31-17 2
n-62-31-17 3
And this is the binding LSF is reporting via bjobs -l -aff:
AFFINITY:
CPU BINDING MEMORY BINDING
------------------------ --------------------
HOST TYPE LEVEL EXCL IDS POL NUMA SIZE
n-62-31-13 core - - /1/1/6 - - -
n-62-31-13 core - - /1/1/8 - - -
n-62-31-13 core - - /1/1/9 - - -
n-62-31-13 core - - /1/1/10 - - -
n-62-31-15 core - - /0/0/10 - - -
n-62-31-15 core - - /0/0/11 - - -
n-62-31-15 core - - /0/0/12 - - -
n-62-31-15 core - - /0/0/14 - - -
n-62-31-8 core - - /0/0/0 - - -
n-62-31-8 core - - /0/0/1 - - -
n-62-31-8 core - - /0/0/3 - - -
n-62-31-8 core - - /0/0/4 - - -
n-62-31-17 core - - /0/0/0 - - -
n-62-31-17 core - - /0/0/2 - - -
n-62-31-17 core - - /0/0/3 - - -
n-62-31-17 core - - /0/0/4 - - -
The case when it doesn't crash is, when the core starting from "0".
$LSB_AFFINITY_HOSTFILE
:
n-62-12-71 0
n-62-12-71 1
n-62-12-71 2
n-62-12-71 3
n-62-12-72 0
n-62-12-72 1
n-62-12-72 2
n-62-12-72 3
n-62-12-73 0
n-62-12-73 1
n-62-12-73 2
n-62-12-73 3
n-62-12-74 0
n-62-12-74 1
n-62-12-74 2
n-62-12-74 3
And this is what LSF is "saying" regarding binding:
AFFINITY:
CPU BINDING MEMORY BINDING
------------------------ --------------------
HOST TYPE LEVEL EXCL IDS POL NUMA SIZE
n-62-12-71 core - - /0/0/0 - - -
n-62-12-71 core - - /0/0/1 - - -
n-62-12-71 core - - /0/0/2 - - -
n-62-12-71 core - - /0/0/3 - - -
n-62-12-72 core - - /0/0/0 - - -
n-62-12-72 core - - /0/0/1 - - -
n-62-12-72 core - - /0/0/2 - - -
n-62-12-72 core - - /0/0/3 - - -
n-62-12-73 core - - /0/0/0 - - -
n-62-12-73 core - - /0/0/1 - - -
n-62-12-73 core - - /0/0/2 - - -
n-62-12-73 core - - /0/0/3 - - -
n-62-12-74 core - - /0/0/0 - - -
n-62-12-74 core - - /0/0/1 - - -
n-62-12-74 core - - /0/0/2 - - -
n-62-12-74 core - - /0/0/3 - - -
I think it's just when prre is not respecting the LSF-binding, then it just crashes..... (unless one is setting this nice HWLOC-environment variable).
Looks like LSF is counting object indices in some strange way - when we ask for the object of the given number, we get a NULL return indicating that it doesn't exist. I'm guessing that the LSF index is based on including objects that are not available to the user. Sounds like a bug on their side to me. Regardless, there isn't anything we can do with it, so we'll just have to declare LSF unsupported. I can add some protection so we print an error message instead of segfaulting.
Setting that envar will break a number of other things, but so long as you don't encounter them, you might be able to operate. Probably going to be hit/miss, though.
Suppose one thing you could check - we are expecting those affinity file values to be physical indices. Maybe LSF has changed to providing logical indices? Just grasping at straws here, but it would explain why we aren't finding that object. Otherwise, a little hard to understand why we aren't finding it in the topology - the os_index (physical) shouldn't depend on the available processors.
@bgoglin Any thoughts on what could be the problem? Debating about removing all LSF-related code, but hate to do so if this is something somebody with access might be able to track down and fix.
I don't understand what the LSF outputs above mean. Is it possible to launch a parallel job with LSF without mpirun and look at where it binds processes? Something like mpirun -np 20 sleep 1000 with LSFrun whatever instead of mpirun.
Just an example. A longer time ago we used torque/moab and also there we shared nodes betweeen different jobs. So if you have a machine with 32 cores, then 24 cores might bee involved with some MPI across multiple machines, and then maybe there is another 4 core job, and 1 single-core job. And we are doing the same kind of game with LSF. I guess one can also do this with SLURM.
Okay...let's make an example.
Asking for 16 processes, distributed across 4 machines with 4 processes on each machine:
#BSUB -n 16
#BSUB -R "span[ptile=4]"
LSF is handing out "physical" cores, and these one you can
then see with "bjobs -l -aff
(This is running on older Skylakes, to they are Dual-Socket with 12 cores per Socket)
The LSF-internal CPU-numbering is a bit weird, but they are also using hwloc somewhere in the background. So it seems that it's really the CPU-core-mapping on the physical CPU-die.
Showing this one now only for the first machine of this mpi-job:
$ bhosts -aff n-62-31-15
Host[377G] n-62-31-15
Socket0
NUMA[0: 0M / 188.1G]
core0(*0)
core2(*1)
core3(*2)
core4(*3)
core5(*4)
core6(*5)
core8(*6)
core9(*7)
core10(*8)
core11(*9)
core12(*10)
core14(*11)
Socket1
NUMA[1: 0M / 188.9G]
core0(*12)
core1(*13)
core2(*14)
core4(*15)
core5(*16)
core6(*17)
core8(*18)
core9(*19)
core10(20)
core11(21)
core13(22)
core14(23)
LSF is always starting counting with logical-core "0" in the (0), but the first on-physical-cpu-die-core doesn't have to be 0.
So LSF-core second-socket, last physical-CPU-die core (on the node n-62-31-15) is: "1/1/14 "
Just to make it complete - then the numactl --hardware output:
$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9
node 0 size: 63527 MB
node 0 free: 52405 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19
node 1 size: 64457 MB
node 1 free: 50947 MB
node distances:
node 0 1
0: 10 21
1: 21 10
And just take the IDS column below as "socket/socket/core-number" in our case here:
AFFINITY:
CPU BINDING MEMORY BINDING
------------------------ --------------------
HOST TYPE LEVEL EXCL IDS POL NUMA SIZE
n-62-31-15 core - - /1/1/10 - - -
n-62-31-15 core - - /1/1/11 - - -
n-62-31-15 core - - /1/1/13 - - -
n-62-31-15 core - - /1/1/14 - - -
n-62-31-16 core - - /1/1/10 - - -
n-62-31-16 core - - /1/1/12 - - -
n-62-31-16 core - - /1/1/13 - - -
n-62-31-16 core - - /1/1/14 - - -
n-62-31-7 core - - /1/1/9 - - -
n-62-31-7 core - - /1/1/11 - - -
n-62-31-7 core - - /1/1/12 - - -
n-62-31-7 core - - /1/1/13 - - -
n-62-31-8 core - - /0/0/0 - - -
n-62-31-8 core - - /0/0/1 - - -
n-62-31-8 core - - /0/0/3 - - -
n-62-31-8 core - - /0/0/4 - - -
So the first core of the job is not necessarily a "core 0", unless one is really asking for a full node, then it's starting with the first core on the first CPU. (unless the first cpu-core on the cpu-die is disabled. Yes....real hardware :-D)
Then the job-ouput....first printing some "debug-info", and then a mini mpirun (using openmpi & the HWLOC-fix). (Would look the same kind of thing if I would use Intel-MPI)
But on this LSF-core-binding-affinity-level...... - we are not using the cpu-physical-die-cores, but the logical core-numbers. So: a machine with 24 cores: first core is 0, last core is 23. So...now back to our mpi-job and mpirun:
LSB_AFFINITY_HOSTFILE:
n-62-31-15 20
n-62-31-15 21
n-62-31-15 22
n-62-31-15 23
n-62-31-16 20
n-62-31-16 21
n-62-31-16 22
n-62-31-16 23
n-62-31-7 20
n-62-31-7 21
n-62-31-7 22
n-62-31-7 23
n-62-31-8 0
n-62-31-8 1
n-62-31-8 2
n-62-31-8 3
LSB_DJOB_HOSTFILE:
n-62-31-15
n-62-31-15
n-62-31-15
n-62-31-15
n-62-31-16
n-62-31-16
n-62-31-16
n-62-31-16
n-62-31-7
n-62-31-7
n-62-31-7
n-62-31-7
n-62-31-8
n-62-31-8
n-62-31-8
n-62-31-8
LSB_DJOB_RANKFILE:
n-62-31-15
n-62-31-15
n-62-31-15
n-62-31-15
n-62-31-16
n-62-31-16
n-62-31-16
n-62-31-16
n-62-31-7
n-62-31-7
n-62-31-7
n-62-31-7
n-62-31-8
n-62-31-8
n-62-31-8
n-62-31-8
LSB_MCPU_HOSTS:
n-62-31-15 4 n-62-31-16 4 n-62-31-7 4 n-62-31-8 4
export HWLOC_ALLOW=all
processor-core is in column 7.
mpirun --display... --report-bindings hostname -s && sleep 1 && PS_PERSONALITY=sgi ps -F \\$\\$
====================== ALLOCATED NODES ======================
n-62-31-15: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: n-62-31-15
n-62-31-16: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: SLOTS_GIVEN
aliases: NONE
n-62-31-7: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: SLOTS_GIVEN
aliases: NONE
n-62-31-8: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: SLOTS_GIVEN
aliases: NONE
=================================================================
====================== ALLOCATED NODES ======================
n-62-31-15: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: n-62-31-15
n-62-31-16: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
aliases: 10.66.31.16,10.66.85.16
n-62-31-7: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
aliases: 10.66.31.7,10.66.85.7
n-62-31-8: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
aliases: 10.66.31.8,10.66.85.8
=================================================================
====================== ALLOCATED NODES ======================
n-62-31-15: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: n-62-31-15
n-62-31-16: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
aliases: 10.66.31.16,10.66.85.16
n-62-31-7: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
aliases: 10.66.31.7,10.66.85.7
n-62-31-8: slots=4 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:SLOTS_GIVEN
aliases: 10.66.31.8,10.66.85.8
=================================================================
================================= JOB MAP =================================
Data for JOB prterun-n-62-31-15-1999239@1 offset 0 Total slots allocated 16
Mapper requested: rank_file Last mapper: rank_file Mapping policy: BYUSER:NOOVERSUBSCRIBE Ranking policy: BYUSER
Binding policy: CORE:IF-SUPPORTED Cpu set: N/A PPR: N/A Cpus-per-rank: N/A Cpu Type: HWT
Num new daemons: 0 New daemon starting vpid INVALID
Num nodes: 4
Data for node: n-62-31-15 State: 3 Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:MAPPED:SLOTS_GIVEN
resolved from n-62-31-15
Daemon: [prterun-n-62-31-15-1999239@0,0] Daemon launched: True
Num slots: 4 Slots in use: 4 Oversubscribed: FALSE
Num slots allocated: 4 Max slots: 0 Num procs: 4
Data for proc: [prterun-n-62-31-15-1999239@1,0]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 0
State: INITIALIZED App_context: 0
Binding: package[1][hwt:20]
Data for proc: [prterun-n-62-31-15-1999239@1,1]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 1
State: INITIALIZED App_context: 0
Binding: package[1][hwt:21]
Data for proc: [prterun-n-62-31-15-1999239@1,2]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 2
State: INITIALIZED App_context: 0
Binding: package[1][hwt:22]
Data for proc: [prterun-n-62-31-15-1999239@1,3]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 3
State: INITIALIZED App_context: 0
Binding: package[1][hwt:23]
Data for node: n-62-31-16 State: 3 Flags: DAEMON_LAUNCHED:MAPPED:SLOTS_GIVEN
resolved from 10.66.31.16
resolved from 10.66.85.16
Daemon: [prterun-n-62-31-15-1999239@0,1] Daemon launched: True
Num slots: 4 Slots in use: 4 Oversubscribed: FALSE
Num slots allocated: 4 Max slots: 0 Num procs: 4
Data for proc: [prterun-n-62-31-15-1999239@1,4]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 4
State: INITIALIZED App_context: 0
Binding: package[1][hwt:20]
Data for proc: [prterun-n-62-31-15-1999239@1,5]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 5
State: INITIALIZED App_context: 0
Binding: package[1][hwt:21]
Data for proc: [prterun-n-62-31-15-1999239@1,6]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 6
State: INITIALIZED App_context: 0
Binding: package[1][hwt:22]
Data for proc: [prterun-n-62-31-15-1999239@1,7]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 7
State: INITIALIZED App_context: 0
Binding: package[1][hwt:23]
Data for node: n-62-31-7 State: 3 Flags: DAEMON_LAUNCHED:MAPPED:SLOTS_GIVEN
resolved from 10.66.31.7
resolved from 10.66.85.7
Daemon: [prterun-n-62-31-15-1999239@0,2] Daemon launched: True
Num slots: 4 Slots in use: 4 Oversubscribed: FALSE
Num slots allocated: 4 Max slots: 0 Num procs: 4
Data for proc: [prterun-n-62-31-15-1999239@1,8]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 8
State: INITIALIZED App_context: 0
Binding: package[1][hwt:20]
Data for proc: [prterun-n-62-31-15-1999239@1,9]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 9
State: INITIALIZED App_context: 0
Binding: package[1][hwt:21]
Data for proc: [prterun-n-62-31-15-1999239@1,10]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 10
State: INITIALIZED App_context: 0
Binding: package[1][hwt:22]
Data for proc: [prterun-n-62-31-15-1999239@1,11]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 11
State: INITIALIZED App_context: 0
Binding: package[1][hwt:23]
Data for node: n-62-31-8 State: 3 Flags: DAEMON_LAUNCHED:MAPPED:SLOTS_GIVEN
resolved from 10.66.31.8
resolved from 10.66.85.8
Daemon: [prterun-n-62-31-15-1999239@0,3] Daemon launched: True
Num slots: 4 Slots in use: 4 Oversubscribed: FALSE
Num slots allocated: 4 Max slots: 0 Num procs: 4
Data for proc: [prterun-n-62-31-15-1999239@1,12]
Pid: 0 Local rank: 0 Node rank: 0 App rank: 12
State: INITIALIZED App_context: 0
Binding: package[0][hwt:0]
Data for proc: [prterun-n-62-31-15-1999239@1,13]
Pid: 0 Local rank: 1 Node rank: 1 App rank: 13
State: INITIALIZED App_context: 0
Binding: package[0][hwt:1]
Data for proc: [prterun-n-62-31-15-1999239@1,14]
Pid: 0 Local rank: 2 Node rank: 2 App rank: 14
State: INITIALIZED App_context: 0
Binding: package[0][hwt:2]
Data for proc: [prterun-n-62-31-15-1999239@1,15]
Pid: 0 Local rank: 3 Node rank: 3 App rank: 15
State: INITIALIZED App_context: 0
Binding: package[0][hwt:3]
=============================================================
[1,1]<stdout>: sebo 1999245 1999239 3 4179 3432 21 18:59 ? R 0:00 /bin/ps -
[1,3]<stdout>: sebo 1999247 1999239 3 4179 3444 23 18:59 ? R 0:00 /bin/ps -
[1,0]<stdout>: sebo 1999244 1999239 3 4179 3444 20 18:59 ? R 0:00 /bin/ps -
[1,2]<stdout>: sebo 1999246 1999239 3 4179 3428 22 18:59 ? R 0:00 /bin/ps -
[1,10]<stdout>: sebo 2875545 2875526 3 4179 3448 22 18:59 ? R 0:00 /bin/ps -
[1,8]<stdout>: sebo 2875544 2875526 3 4179 3444 20 18:59 ? R 0:00 /bin/ps -
[1,9]<stdout>: sebo 2875546 2875526 3 4179 3428 21 18:59 ? R 0:00 /bin/ps -
[1,6]<stdout>: sebo 2672111 2672092 3 4179 3436 22 18:59 ? R 0:00 /bin/ps -
[1,4]<stdout>: sebo 2672110 2672092 3 4179 3440 20 18:59 ? R 0:00 /bin/ps -
[1,7]<stdout>: sebo 2672112 2672092 3 4179 3432 23 18:59 ? R 0:00 /bin/ps -
[1,11]<stdout>: sebo 2875547 2875526 3 4179 3520 23 18:59 ? R 0:00 /bin/ps -
[1,5]<stdout>: sebo 2672113 2672092 3 4179 3432 21 18:59 ? R 0:00 /bin/ps -
[1,14]<stdout>: sebo 2581432 2581413 3 4179 3400 2 18:59 ? R 0:00 /bin/ps -
[1,13]<stdout>: sebo 2581434 2581413 3 4179 3388 1 18:59 ? R 0:00 /bin/ps -
[1,15]<stdout>: sebo 2581433 2581413 3 4179 3364 3 18:59 ? R 0:00 /bin/ps -
[1,12]<stdout>: sebo 2581431 2581413 3 4179 3364 0 18:59 ? R 0:00 /bin/ps -
So if LSF is reporting in the affinity-hostfile with
n-62-31-15 20
that justs means that the 20th core on this machine is reserved as the "first logical core" for the usage for this LSF-job. And LSF is then creating a cpuset and LSF expects that MPI is respecting "this".
So....that's as far as I understand this.
And as "WTF-item": Dell is numbering everything a bit "different", so here is a Dell node:
$ bhosts -aff n-62-12-75
Host[503.3G] n-62-12-75
Socket0
NUMA[0: 0M / 251.3G]
core0(0)
core8(2)
core1(4)
core9(6)
core2(8)
core10(10)
core3(12)
core11(14)
core4(16)
core12(18)
core5(20)
core13(22)
core6(24)
core14(26)
core7(28)
core15(30)
Socket1
NUMA[1: 0M / 251.9G]
core0(1)
core8(3)
core1(5)
core9(7)
core2(9)
core10(11)
core3(13)
core11(15)
core4(17)
core12(19)
core5(21)
core13(23)
core6(25)
core14(27)
core7(29)
core15(31)
here the matching numa-output:
#numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30
node 0 size: 257401 MB
node 0 free: 121271 MB
node 1 cpus: 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31
node 1 size: 258043 MB
node 1 free: 301 MB
node distances:
node 0 1
0: 10 20
1: 20 10
And with openmpi-5.0.3 everthing was still working fine, because it was respecting the cores which LSF was "suggesting" to use.
The only difference I can see between OMPI v5.0.3 and head of PRRTE master is that the OMPI code still used some "overlay" code instead of just directly calling HWLOC functions (e.g., hwloc_get_obj_by_type
). However, I don't see that reflected in the rank_file code - it just calls HWLOC functions.
So I'm not sure why the older version was working and the newer one doesn't. Not that much has changed in the affected areas. Would have to dig into the code and follow the proc placement procedure at an atomistic level to try and see a difference, assuming it must exist (but likely is very subtle). As stated above, the basic problem seems to be that we get an unavailable object (i.e., NULL) returned when using the HWLOC function to obtain the physical core object specified in the affinity file, and there is not much I can do about it from there.
So... - this is the "troubling patch":
diff --git a/src/hwloc/hwloc_base_util.c b/src/hwloc/hwloc_base_util.c
index 4a32a7fa1a..055b6dae4b 100644c/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte/includea
--- a/src/hwloc/hwloc_base_util.c
+++ b/src/hwloc/hwloc_base_util.c
@@ -1865,12 +1865,18 @@ int prte_hwloc_base_topology_set_flags(hwloc_topology_t topology, unsigned long
{
if (io) {
#if HWLOC_API_VERSION < 0x00020000
+ flags |= HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM;
flags |= HWLOC_TOPOLOGY_FLAG_IO_DEVICES;
#else
int ret = hwloc_topology_set_io_types_filter(topology, HWLOC_TYPE_FILTER_KEEP_IMPORTANT);
if (0 != ret) {
return ret;
}
+# if HWLOC_API_VERSION < 0x00020100
+ flags |= HWLOC_TOPOLOGY_FLAG_WHOLE_SYSTEM;
+# else
+ flags |= HWLOC_TOPOLOGY_FLAG_INCLUDE_DISALLOWED;
+# endif
#endif
}
// Blacklist the "gl" component due to potential conflicts.
We are using here a "fresh" hwloc, so it's the third case for us:
Here the example just using hwloc:
LSB_AFFINITY_HOSTFILE:
n-62-31-15 20
n-62-31-15 21
n-62-31-15 22
n-62-31-15 23
n-62-31-16 20
n-62-31-16 21
n-62-31-16 22
n-62-31-16 23
n-62-31-7 20
n-62-31-7 21
n-62-31-7 22
n-62-31-7 23
n-62-31-17 0
n-62-31-17 1
n-62-31-17 2
n-62-31-17 3
Loaded dependency [mpi/5.0.3-gcc-14.1.0-binutils-2.42]: gcc/14.1.0-binutils-2.42
Loaded module: mpi/5.0.3-gcc-14.1.0-binutils-2.42
# the good usecase with and older mpi which works
# and just gives the "correct" cores.
Loading mpi/5.0.3-gcc-14.1.0-binutils-2.42
Loading requirement: gcc/14.1.0-binutils-2.42
hwloc-info 2.10.0
hwloc-ls --no-io --filter core:important
Machine (377GB total)
Package L#0
NUMANode L#0 (P#0 188GB)
Package L#1
NUMANode L#1 (P#1 189GB)
L3 L#0 (19MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#20)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#21)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#22)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#23)
Loaded dependency [mpi/5.0.5-gcc-14.2.0-binutils-2.43-sebotest1]: gcc/14.2.0-binutils-2.43
Loaded module: mpi/5.0.5-gcc-14.2.0-binutils-2.43-sebotest1
# and this newer mpi with include-disallowed gives cores which are not in LSF's cpuset:
Loading mpi/5.0.5-gcc-14.2.0-binutils-2.43-sebotest1
Loading requirement: gcc/14.2.0-binutils-2.43
hwloc-info 2.11.1
hwloc-ls --no-io --disallowed --filter core:important
Machine (377GB total)
Package L#0
NUMANode L#0 (P#0 188GB)
L3 L#0 (19MB)
L2 L#0 (1024KB) + L1d L#0 (32KB) + L1i L#0 (32KB) + Core L#0 + PU L#0 (P#0)
L2 L#1 (1024KB) + L1d L#1 (32KB) + L1i L#1 (32KB) + Core L#1 + PU L#1 (P#1)
L2 L#2 (1024KB) + L1d L#2 (32KB) + L1i L#2 (32KB) + Core L#2 + PU L#2 (P#2)
L2 L#3 (1024KB) + L1d L#3 (32KB) + L1i L#3 (32KB) + Core L#3 + PU L#3 (P#3)
L2 L#4 (1024KB) + L1d L#4 (32KB) + L1i L#4 (32KB) + Core L#4 + PU L#4 (P#4)
L2 L#5 (1024KB) + L1d L#5 (32KB) + L1i L#5 (32KB) + Core L#5 + PU L#5 (P#5)
L2 L#6 (1024KB) + L1d L#6 (32KB) + L1i L#6 (32KB) + Core L#6 + PU L#6 (P#6)
L2 L#7 (1024KB) + L1d L#7 (32KB) + L1i L#7 (32KB) + Core L#7 + PU L#7 (P#7)
L2 L#8 (1024KB) + L1d L#8 (32KB) + L1i L#8 (32KB) + Core L#8 + PU L#8 (P#8)
L2 L#9 (1024KB) + L1d L#9 (32KB) + L1i L#9 (32KB) + Core L#9 + PU L#9 (P#9)
L2 L#10 (1024KB) + L1d L#10 (32KB) + L1i L#10 (32KB) + Core L#10 + PU L#10 (P#10)
L2 L#11 (1024KB) + L1d L#11 (32KB) + L1i L#11 (32KB) + Core L#11 + PU L#11 (P#11)
Package L#1
NUMANode L#1 (P#1 189GB)
L3 L#1 (19MB)
L2 L#12 (1024KB) + L1d L#12 (32KB) + L1i L#12 (32KB) + Core L#12 + PU L#12 (P#12)
L2 L#13 (1024KB) + L1d L#13 (32KB) + L1i L#13 (32KB) + Core L#13 + PU L#13 (P#13)
L2 L#14 (1024KB) + L1d L#14 (32KB) + L1i L#14 (32KB) + Core L#14 + PU L#14 (P#14)
L2 L#15 (1024KB) + L1d L#15 (32KB) + L1i L#15 (32KB) + Core L#15 + PU L#15 (P#15)
L2 L#16 (1024KB) + L1d L#16 (32KB) + L1i L#16 (32KB) + Core L#16 + PU L#16 (P#16)
L2 L#17 (1024KB) + L1d L#17 (32KB) + L1i L#17 (32KB) + Core L#17 + PU L#17 (P#17)
L2 L#18 (1024KB) + L1d L#18 (32KB) + L1i L#18 (32KB) + Core L#18 + PU L#18 (P#18)
L2 L#19 (1024KB) + L1d L#19 (32KB) + L1i L#19 (32KB) + Core L#19 + PU L#19 (P#19)
L2 L#20 (1024KB) + L1d L#20 (32KB) + L1i L#20 (32KB) + Core L#20 + PU L#20 (P#20)
L2 L#21 (1024KB) + L1d L#21 (32KB) + L1i L#21 (32KB) + Core L#21 + PU L#21 (P#21)
L2 L#22 (1024KB) + L1d L#22 (32KB) + L1i L#22 (32KB) + Core L#22 + PU L#22 (P#22)
L2 L#23 (1024KB) + L1d L#23 (32KB) + L1i L#23 (32KB) + Core L#23 + PU L#23 (P#23)
I don't think that's quite correct - here is what that section of code looks like in PRRTE master branch:
int prte_hwloc_base_topology_set_flags(hwloc_topology_t topology, unsigned long flags, bool io)
{
if (io) {
#if HWLOC_API_VERSION < 0x00020000
flags |= HWLOC_TOPOLOGY_FLAG_IO_DEVICES;
#else
int ret = hwloc_topology_set_io_types_filter(topology, HWLOC_TYPE_FILTER_KEEP_IMPORTANT);
if (0 != ret) {
return ret;
}
#endif
}
// Blacklist the "gl" component due to potential conflicts.
// See "https://github.com/open-mpi/ompi/issues/10025" for
// an explanation
#ifdef HWLOC_VERSION_MAJOR
#if HWLOC_VERSION_MAJOR > 2
hwloc_topology_set_components(topology, HWLOC_TOPOLOGY_COMPONENTS_FLAG_BLACKLIST, "gl");
#elif HWLOC_VERSION_MAJOR == 2 && HWLOC_VERSION_MINOR >= 1
hwloc_topology_set_components(topology, HWLOC_TOPOLOGY_COMPONENTS_FLAG_BLACKLIST, "gl");
#endif
#endif
return hwloc_topology_set_flags(topology, flags);
}
We removed the lines you cite some time ago as we switched to using HWLOC's "allowed cpuset" function. However, you have stated that PRRTE master continues to fail for LSF - which means that there is some other cause.
Been some discussion over here about what to do with this given that nobody over here has access to an appropriate system. Best I can determine, the rank_file code is working correctly when given a correct rankfile. The issue therefore seems to lie in the conversion of the LSF affinity file to a PRRTE rankfile. In some situations, when we ask for the HWLOC object corresponding to the LSF physical core ID, we get a NULL return indicating that the specified object is not available to us.
It isn't clear if the problem lies in LSF (either in LSF itself or in your local setup?), in HWLOC (perhaps not correctly parsing the bound topology within the allocation?), or in the way we are using HWLOC (maybe in the flags we pass when reading the topology?). We had a couple of suggestions:
One data point that might help would be to grab an XML output from lstopo
from within the allocation and pass that along with the affinity file so we can see if the specified object is present, and perhaps identify why we cannot get it returned by HWLOC.
You might want to open an LSF ticket on this problem - could be there is something problematic in the LSF config.
I don't know if @bgoglin has any other thoughts or suggestions. I'm afraid I'm somewhat stuck at this point.
Here are some hwloc-xml-files:
for the "working" variant the cpusets are all the same on all nodes, and the "crashing" variant has (at least?) two different kind of cpusets.
Ah - well that at least partially explains the problem. The code explicitly states that it assumes a homogeneous system. Been that way from the very beginning. I'll try to find some time to look at it, but make no promises.
Can you give this a try? Apply to PRRTE master branch:
diff --git a/src/mca/rmaps/rank_file/rmaps_rank_file.c b/src/mca/rmaps/rank_file/rmaps_rank_file.c
index d1a2401a41..acaf73aaa6 100644
--- a/src/mca/rmaps/rank_file/rmaps_rank_file.c
+++ b/src/mca/rmaps/rank_file/rmaps_rank_file.c
@@ -756,6 +756,28 @@ static int prte_rmaps_rf_process_lsf_affinity_hostfile(prte_job_t *jdata,
return PRTE_SUCCESS;
}
+static bool quickmatch(prte_node_t *nd, char *name)
+{
+ int n;
+
+ if (0 == strcmp(nd->name, name)) {
+ return true;
+ }
+ if (0 == strcmp(nd->name, prte_process_info.nodename) &&
+ (0 == strcmp(name, "localhost") ||
+ 0 == strcmp(name, "127.0.0.1"))) {
+ return true;
+ }
+ if (NULL != nd->aliases) {
+ for (n=0; NULL != nd->aliases[n]; n++) {
+ if (0 == strcmp(nd->aliases[n], name)) {
+ return true;
+ }
+ }
+ }
+ return false;
+}
+
static int prte_rmaps_rf_lsf_convert_affinity_to_rankfile(char *affinity_file, char **aff_rankfile)
{
FILE *fp;
@@ -765,9 +787,9 @@ static int prte_rmaps_rf_lsf_convert_affinity_to_rankfile(char *affinity_file, c
char *tmp_str = NULL;
size_t len;
char **cpus;
- int i;
+ int i, j;
hwloc_obj_t obj;
- prte_topology_t *my_topo = NULL;
+ prte_node_t *node, *nptr;
if( NULL != *aff_rankfile) {
free(*aff_rankfile);
@@ -835,11 +857,33 @@ static int prte_rmaps_rf_lsf_convert_affinity_to_rankfile(char *affinity_file, c
// Convert the Physical CPU set from LSF to a Hwloc logical CPU set
pmix_output_verbose(20, prte_rmaps_base_framework.framework_output,
"mca:rmaps:rf: (lsf) Convert Physical CPUSET from <%s>", sep);
- my_topo = (prte_topology_t *) pmix_pointer_array_get_item(prte_node_topologies, 0);
+
+ // find the named host
+ nptr = NULL;
+ for (j = 0; j < prte_node_pool->size; j++) {
+ node = (prte_node_t *) pmix_pointer_array_get_item(prte_node_pool, j);
+ if (NULL == node) {
+ continue;
+ }
+ if (quickmatch(node, hstname)) {
+ nptr = node;
+ break;
+ }
+ }
+ if (NULL == nptr) {
+ /* wasn't found - that is an error */
+ pmix_show_help("help-rmaps_rank_file.txt",
+ "resource-not-found", true,
+ hstname);
+ fclose(fp);
+ close(fp_rank);
+ return PRTE_ERROR;
+ }
+
cpus = PMIX_ARGV_SPLIT_COMPAT(sep, ',');
for(i = 0; NULL != cpus[i]; ++i) {
- // assume HNP has the same topology as other nodes
- obj = hwloc_get_pu_obj_by_os_index(my_topo->topo, strtol(cpus[i], NULL, 10)) ;
+ // get the specified object
+ obj = hwloc_get_pu_obj_by_os_index(nptr->topology->topo, strtol(cpus[i], NULL, 10)) ;
if (NULL == obj) {
PMIX_ARGV_FREE_COMPAT(cpus);
fclose(fp);
openpmix 3ecdbf32c5dc77beb066c8683df49648cb920804
prrte bc3c11e76a4928062ada6c423906ab5ad3b758e9
openmpi-5.0.5
It works....thanks a lot....but now I'm crashing in pmix....(after a while of running the petsc-stream-benchmark, the output is:
$ cat scaling.log
1 15043.8528 Rate (MB/s)
2 27858.5442 Rate (MB/s) 1.85182
3 42371.2441 Rate (MB/s) 2.81651
4 55388.2854 Rate (MB/s) 3.68178
5 67671.5669 Rate (MB/s) 4.49827
6 71739.3803 Rate (MB/s) 4.76867
7 77762.7681 Rate (MB/s) 5.16906
8 90678.2054 Rate (MB/s) 6.02757
<crash>
Here is one of the stack-traces:
(gdb) info stack
#0 0x00007f011a88b94c in __pthread_kill_implementation () from /lib64/libc.so.6
#1 0x00007f011a83e646 in raise () from /lib64/libc.so.6
#2 0x00007f011a828885 in abort () from /lib64/libc.so.6
#3 0x00007f011a82871b in __assert_fail_base.cold () from /lib64/libc.so.6
#4 0x00007f011a837386 in __assert_fail () from /lib64/libc.so.6
#5 0x00007f011b02098d in pmix_gds_base_store_modex (buff=buff@entry=0x7f011a5eeab0, cb_fn=cb_fn@entry=0x7f011b05b740 <_hash_store_modex>, cbdata=cbdata@entry=0x7f0114030c70) at base/gds_base_fns.c:149
#6 0x00007f011b05b71c in hash_store_modex (buf=0x7f011a5eeab0, cbdata=0x7f0114030c70) at gds_hash.c:1328
#7 0x00007f011ae888fa in _mdxcbfunc (sd=-1, args=args@entry=4, cbdata=0x3acd2440) at server/pmix_server.c:3679
#8 0x00007f011ad61943 in event_process_active_single_queue (base=base@entry=0x3acbb6d0, activeq=activeq@entry=0x3acbbb20, max_to_process=max_to_process@entry=2147483647, endtime=endtime@entry=0x0) at event.c:1691
#9 0x00007f011ad61e8f in event_process_active (base=base@entry=0x3acbb6d0) at event.c:1783
#10 0x00007f011ad628d7 in event_base_loop (base=0x3acbb6d0, flags=flags@entry=1) at event.c:2006
#11 0x00007f011aee8d8c in progress_engine (obj=0x3ad57d18) at runtime/pmix_progress_threads.c:110
#12 0x00007f011a889c02 in start_thread () from /lib64/libc.so.6
#13 0x00007f011a90ec40 in clone3 () from /lib64/libc.so.6
sorry. Forgot to add the last words:
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive registered command from [prterun-n-62-12-60-83080@0,3]
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive got registered for job prterun-n-62-12-60-83080@1
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive got registered for vpid 12
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive got registered for vpid 13
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive got registered for vpid 14
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:receive done processing commands
[n-62-12-60:83080] [prterun-n-62-12-60-83080@0,0] plm:base:launch prterun-n-62-12-60-83080@1 registered
[n-62-12-60:83080] PMIX ERROR: PMIX_ERR_UNPACK_READ_PAST_END_OF_BUFFER in file base/gds_base_fns.c at line 148
prterun: base/gds_base_fns.c:149: pmix_gds_base_store_modex: Assertion `PMIX_OBJ_MAGIC_ID == ((pmix_object_t *) (&bkt))->obj_magic_id' failed.
No idea what that app does, but odd that it would crash after running for awhile. The referenced operation takes place during MPI_Init. Would have to think a bit about it, but any further info about the run that failed (like what is different relative to the runs that worked) would help.
So...I have also recompiled petsc now. It's the same problem, but also discovered now the "bad prefix" messages.
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got local launch complete for vpid 11 state RUNNING
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive done processing commands
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:launch wiring up iof for job prterun-n-62-12-60-93112@1
[n-62-12-60:93120] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-60:93117] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-60:93119] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-60:93118] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-61:80262] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-61:80260] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-61:80261] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-61:80263] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-63:64744] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-63:64742] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-63:64743] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-63:64745] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-62:88964] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-62:88962] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-62:88965] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-62:88963] mca:base:process_repository_item filename /appl9/gcc/14.2.0-binutils-2.43/openmpi/5.0.5-lsf10-alma92-Z-newprrte2/lib/pmix/ has bad prefix - expected:
pmix_mca_
or
libpmix_mca_
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive processing msg
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive registered command from [prterun-n-62-12-60-93112@0,1]
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for job prterun-n-62-12-60-93112@1
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 4
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 5
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 6
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 7
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive done processing commands
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive processing msg
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive registered command from [prterun-n-62-12-60-93112@0,3]
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for job prterun-n-62-12-60-93112@1
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 12
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 13
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 14
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 15
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive done processing commands
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive processing msg
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive registered command from [prterun-n-62-12-60-93112@0,2]
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for job prterun-n-62-12-60-93112@1
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 8
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 9
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 10
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive got registered for vpid 11
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:receive done processing commands
[n-62-12-60:93112] [prterun-n-62-12-60-93112@0,0] plm:base:launch prterun-n-62-12-60-93112@1 registered
[n-62-12-60:93112] PMIX ERROR: PMIX_ERR_UNPACK_READ_PAST_END_OF_BUFFER in file base/gds_base_fns.c at line 148
prterun: base/gds_base_fns.c:149: pmix_gds_base_store_modex: Assertion `PMIX_OBJ_MAGIC_ID == ((pmix_object_t *) (&bkt))->obj_magic_id' failed.
------------------------------------------------
See graph in the file src/benchmarks/streams/MPIscaling.png
I think I have to re-do the stuff in a clean way again, just to make sure I haven't created some non-debuggable-mess.
I have patched now a prrte-3.0.6 with your patch and using pmix-5.0.3 and openmpi-5.0.5 and mpi across "different nodes" works without crashing. So the above PMIX-unpack-error seems to be unrelated to this (LSF-related) problem. So maybe a slurm-user can give this one a try?
Not sure what it would have to do with Slurm, but I agree it is also unlikely to relate to LSF either. Will follow up with the change to PRRTE. Thanks for the assist!
Background information
What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)
I am testing this on 5.0.3 vs. 5.0.5 (only 5.0.5 has this problem). I don't have 5.0.4 installed, so I don't know if that is affected, but I am quite confident that this also occurs for 5.0.4 (since the submodule for prrte is the same for 5.0.4 and 5.0.5).
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From the sources. A little bit of
ompi_info -c
info:And env-vars:
Version numbers are of course different for 5.0.5, otherwise the same.
Please describe the system on which you are running
Operating system/version:
Alma Linux 9.4
Computer hardware:
Tested on various hardware, both with and without hardware threads (see below).
Network type: Not relevant, I think.
Details of the problem
The problem relates to the interaction between LSF and OpenMPI.
A couple of issues that are shown here.
Bug introduced between 5.0.3 and 5.0.5
I encounter problems running simple programs (hello-world) in a multinode configuration:
This will run on 4 nodes, each using 2 cores.
Output from:
5.0.3
:This looks reasonable. And LSF affinity file corresponds to this binding.
Note, that these nodes does not have hyper-threading enabled. So our guess is that LSF always puts affinity for HWT, which is OK. It still obeys the default core binding which is what our end-users would expect.
5.0.5
Clearly something went wrong when parsing the affinity hostfile.
The hostfile looks like this (for both 5.0.3 and 5.0.5):
(different job, hence different nodes/ranks)
So the above, indicates some regression for this handling. I tried to backtrack something from prrte, but I am not skilled enough for the logic happening there.
I tracked the submodule hashes of OpenMPI between 5.0.3 and 5.0.4 to these:
3a70fac9a21700b31c4a9f9958afa207a627f0fa
b68a0acb32cfc0d3c19249e5514820555bcf438b
b68a0acb32cfc0d3c19249e5514820555bcf438b
So my suspicion is that also 5.0.4 has this.
Now, these things are relatively easily fixed.
I just do:
and rely on cgroups. Then I get the correct behaviour. Correct bindings etc.
By unsetting, I also fallback to the default OpenMPI binding:
5.0.3
Note here that it says
core
instead ofhwt
.5.0.5
So same thing happens, good!
Nodes with HW threads
This is likely related to the above, I just put it here for completeness.
As mentioned above I can do
unset LSB_AFFINITY_HOSTFILE
and get correct bindings.However, the above works only when there are no HWT.
Here is the same thing for a node with 2 HWT / core (EPYC milan, 32-core/socket in 2-socket)
Only requesting 4 cores here.
5.0.3
This looks OK. Still binding to the cgroup cores.
5.0.5
This looks bad, wrong core binding, should have been 6,7 on both nodes.
If you need more information, let me know!