open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.13k stars 858 forks source link

[v5.0.x] 3rd-party: bump prrte,pmix submodules #12626

Closed wenduwan closed 3 months ago

wenduwan commented 3 months ago

bot:notacherrypick

rhc54 commented 3 months ago

Do you want to have it execute all the mpi4py tests? Don't know if they will pass, but might be a good data point.

rhc54 commented 3 months ago

Is there a way to get a stacktrace from that failure? Looks like a new failure mode to me - list item being multiply released.

hppritcha commented 3 months ago

This looks like something I fixed in prrre master a while back to get singleton spawn working again

rhc54 commented 3 months ago

Hmmm...I dug back and found a commit in Feb that did a few things, one of which is relevant here. If you want to "pull" PRRTE v3.0 again, we can see if that fixes it.

rhc54 commented 3 months ago

Looks like it is failing somewhere else now? I can't tell, I'm afraid - let me know what you think.

rhc54 commented 3 months ago

The NVIDIA failure is in the infrastructure - failed to pull something. I believe the other failures are ones you've already been tracking. So I guess the question is: is this good enough for you? The comm_spawn related issues are fixed in OMPI main (combination of OMPI and PRRTE, IIRC), but I don't expect to bring the PRRTE changes back to the v3.0 release branch as it would take too much effort to unwind the conflicts. If someone else wants to tackle it, I'm happy to look at the PR.

Otherwise, this is probably as good as it is going to get. Next PMIx release will launch the v6 series, and PRRTE v4 series will premier at the same time - but not likely until late this year.

wenduwan commented 3 months ago

@hppritcha I saw some failures in our CI. Checking now.

wenduwan commented 3 months ago

Test failure is real. I need to take a closer look.

rhc54 commented 3 months ago

Bah - I see the problem. Missed a couple of critical cherry-picks. Fix coming.

rhc54 commented 3 months ago

Okay, pull PMIx again - sorry for the noise

rhc54 commented 3 months ago

okay, pull it again - found another missing commit.

rhc54 commented 3 months ago

Hooray! That looks much better 😄

hppritcha commented 3 months ago

@wenduwan i will cherry-pick https://github.com/open-mpi/ompi/pull/12465 back to 5.0.x once you've merged this PR.

rhc54 commented 3 months ago

Anyone have a reason not to declare this combination of PMIx/PRRTE as good for release? Or do you need to do some more checking?

wenduwan commented 3 months ago

@rhc54 We typically allow at least 1 week for community testing. Are you planning changes to the release branches?

janjust commented 3 months ago

I know in the past several releases we were not pinned against the latest/greatest pmix/prrte - but if this sha fixes all or most of our issues I don't see a reason why not mark it as release and cut our RC of it.

rhc54 commented 3 months ago

I won't be making any changes to those branches unless a problem surfaces that requires it. If this works for your release (when you are ready to do it), I'll simply tag and release at this point.

hppritcha commented 3 months ago

it turns out the cid intercomm fixes are already in 5.0.x - see https://github.com/open-mpi/ompi/pull/12490

wenduwan commented 3 months ago

@rhc54 Thanks. We will likely let you know next week, when we cut the first release candidate.

wenduwan commented 3 months ago

I still see failure in our CI after merging this change. I need to investigate separately.

wenduwan commented 3 months ago

I checked a segfault and got the backtrace

(gdb) bt
#0  0x0000ffff92405ea0 in hwloc_bitmap_not () from /lib64/libhwloc.so.5
#1  0x0000ffff92406128 in hwloc_bitmap_list_snprintf () from /lib64/libhwloc.so.5
#2  0x0000ffff924062d4 in hwloc_bitmap_list_asprintf () from /lib64/libhwloc.so.5
#3  0x0000ffff92af5b04 in bind_generic (jdata=0x1efd6e50, proc=0x1efd9bf0, node=0x1eea84a0, obj=0x0, options=0xfffff1f11d00)
    at base/rmaps_base_binding.c:132
#4  0x0000ffff92af677c in prte_rmaps_base_bind_proc (jdata=0x1efd6e50, proc=0x1efd9bf0, node=0x1eea84a0, obj=0x0, options=0xfffff1f11d00)
    at base/rmaps_base_binding.c:448
#5  0x0000ffff92af215c in prte_rmaps_base_setup_proc (jdata=0x1efd6e50, idx=0, node=0x1eea84a0, obj=0x0, options=0xfffff1f11d00)
    at base/rmaps_base_support_fns.c:568
#6  0x0000ffff922e41e0 in ppr_mapper (jdata=0x1efd6e50, options=0xfffff1f11d00) at rmaps_ppr.c:230
#7  0x0000ffff92aec38c in prte_rmaps_base_map_job (fd=-1, args=4, cbdata=0x1efc96c0) at base/rmaps_base_map_job.c:839
#8  0x0000ffff9272d8f8 in event_base_loop () from /lib64/libevent_core-2.0.so.5
#9  0x0000000000409304 in main (argc=14, argv=0xfffff1f12d78) at prte.c:1245
wenduwan commented 3 months ago

The segfault can be fixed by reverting this commit in prrte 🤔

@rhc54 Does it ring a bell?

rhc54 commented 3 months ago

Not really - afraid I need to know something about what you are doing. What is the cmd line? Anything unusual about the topology? Can you send the topology along?

BTW: reverting that commit just recreates the error that it fixed, so that isn't an option (we'll just have AMD complaining).

wenduwan commented 3 months ago

The segfault happens on ARM machines(single-socket, 64 cores). I can reproduce on either 1 or 2 nodes.

Example for a single node test(Debian 10)

mpirun --wdir . -n 64 /home/admin/osu-micro-benchmarks/install/libexec/osu-micro-benchmarks/mpi/pt2pt/osu_mbw_mr
rhc54 commented 3 months ago

Sigh...ARM is a core type, not a node topology, and "single socket" tells me nothing about the chip. I can't do anything without knowing the topology.

wenduwan commented 3 months ago

Would lstopo output be helpful?

rhc54 commented 3 months ago

Yes - please pass along the xml output.

wenduwan commented 3 months ago

topo.patch

GH doesn't allow uploading .xml files so I changed it to .patch. Please let me know if you can open it.

rhc54 commented 3 months ago

I'm afraid that the cmd line you gave me couldn't possible have resulted in that stack trace - it does not activate the ppr mapper (it uses the round-robin mapper instead). It also runs just fine on the given topology.

Please double-check the cmd line.

wenduwan commented 3 months ago

Sorry I should have added --map-by ppr:64:node. I had to redact other information.

rhc54 commented 3 months ago

Tried that - it works fine. Is there some other thing that might influence the result?

rhc54 commented 3 months ago

Should say - I tried that on PRRTE master and it works fine. However, the mapping framework on master is identical to that on the v3.0 branch.

wenduwan commented 3 months ago

@rhc54 I wonder what pmix build did you test with?

wenduwan commented 3 months ago

Meanwhile, let me also check the hwloc version on the failing systems.

wenduwan commented 3 months ago

So far the failure happens on Amazon Linux 2 and Debian 10, ARM only.

For Debian 10, hwloc is very old(1.11.12-3) https://packages.debian.org/buster/hwloc

On AL2 it is 1.11.8

I have a hypothesis that hwloc version plays a role. The prrte commit also tempered with the hwloc version check.

rhc54 commented 3 months ago

That would indeed make a difference. I can play with the older stuff, but it will have to wait for a bit - any hwloc that old becomes a very low priority.

The prrte commit also tempered with the hwloc version check.

Afraid that isn't correct - we haven't done anything with the hwloc version in ages.

wenduwan commented 3 months ago

I also got input from the team that the failure can be reproduced on Intel but not AMD.

janjust commented 3 months ago

@rhc54 We'll revert the prrte to v3.0.5 to unblock AWS CI and leave pmix as is (whatever the latest bump is). If you get around triaging the issue great, we'll happily bump it up again, but in the meantime, we'll revert to unblock the CIs

rhc54 commented 3 months ago

@wenduwan Unfortunately, the topology you provided was generated using HWLOC v2.0 or greater and cannot be read by an earlier version of HWLOC. Can you please provide me with a topology generated by the actual HWLOC version you want to use?

rhc54 commented 3 months ago

FWIW: I had to do some minor fixes to allow PRRTE master branch to build against HWLOC 1.11.8, but after that was done it ran just fine using a topology generated with 1.11.8. It came from an Intel-based node, but I tried various mapping modes and all worked just fine.

Will need to see yours to verify, and will also check the v3.0 branch in case something is missing.

wenduwan commented 3 months ago

@rhc54 Thanks for looking into this. I attached an output with hwloc v1

topo_v1.patch

rhc54 commented 3 months ago

So the problem isn't in the mapper or really with PRRTE - the problem is that pre-Stone Age HWLOC doesn't see any NUMA regions on the node, and we default to binding to NUMA. More current HWLOC does see a NUMA region, and so it has no problem with the topology.

What I have to figure out is: what should we do in this situation? Since we were not given an explicit binding directive, I could (and perhaps should) simply not bind at all. Or I could default to something like bind-to-core in the absence of a NUMA region.

Any thoughts?

janjust commented 3 months ago

Bind to none would be my suggestion. Binding to core in the presence of threads would lead to core-sharing and potentially unexpected performance issues.

rhc54 commented 3 months ago

Hmmm...just confirmed that you cannot build the PRRTE v3.0 branch with pre-Stone Age HWLOC (v1.11.x).

@wenduwan Can you please confirm that your team configured PRRTE against a modern HWLOC and then ran it using the older version?? If that's the case, then I'll need to create a runtime check to ensure we don't do that as it will definitely lead to trouble. There are a lot of '#if` protections in the code that will take the incorrect branch in that scenario.

wenduwan commented 3 months ago

@rhc54 We have been building ompi against hwloc v1.11.x and running with it for the last couple years, and more recently prrte/pmix. Did you have any trouble with building prrte?

rhc54 commented 3 months ago

Head of master and head of v3.0 branch both don't compile against pre-Stone Age HWLOC, but I fixed the breakage. With the fixes, I can reproduce the problem - but as noted above, the problem lies in HWLOC failing to identify a NUMA node. I'm looking into why that is happening and can obviously protect the code against it - just have to decide on the best alternative solution. Probably will just "bind to none" since it wasn't specified.

rhc54 commented 3 months ago

Probably need the runtime check anyway as someone might just do that (build against HWLOC v2.x and run against v1.y), which will definitely cause segfaults. Will need to give it some thought - OMPI is similarly vulnerable, FWIW. Guess we just haven't had anyone do that yet?

rhc54 commented 3 months ago

Confirmed - pre-Stone Age HWLOC doesn't identify a NUMA object in the topology, while v2.x does, and that explains the difference in behavior. Looks like we used to just default to not binding in that situation, so I'll just add some checks to protect PRRTE in this circumstance and (a) error out if the user specified anything other than bind-to core/thread, and (b) default to not binding if the user didn't specify anything.

rhc54 commented 3 months ago

Okay, PRRTE v3.0 branch has been updated with the fix. Please bump it and see how it does.

wenduwan commented 3 months ago

@rhc54 Thank you very much for the quick turnaround!