openmpi-5.0.5 won't spawn

sukanka commented 1 week ago

Background information

There may be a regression in openmpi-5.0 series

What version of Open MPI are you using? (e.g., v4.1.6, v5.0.1, git branch name and hash, etc.)

v5.0.5, but in fact this regression has been there since 5.0.1

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From archlinux repo

Please describe the system on which you are running

Operating system/version: Arch Linux x86_64 Linux 6.11.6-zen1-1-zen
Computer hardware: Laptop with AMD Ryzen 7 8845H w/ Radeon 780M Graphics (16) @ 5.10 GHz and NVIDIA GeForce RTX 4070 Max-Q / Mobile
Network type: Wired

Details of the problem

The MWE provided in https://github.com/open-mpi/ompi/issues/11749#issuecomment-1591610547 does not work with openmpi 5.0, but it does work with 4.1.6.

hppritcha commented 1 week ago

Would it be possible to test using the open MPI main branch? Also, you may need to disable some collective components:

export OMPI_MCA_coll=^han,hcoll

They seem to get confused trying to handle the intercomm_merge call in the test case in #11749 . Are you using the mpi4py testsuite to observe this problem? If so could you post the output from

python3 ./main.py -v -k

?

rhc54 commented 1 week ago

FWIW: it works fine for me if you disable those components. Thanks @hppritcha !

rhc54 commented 1 week ago

Should have added - it also works fine if you run it using mpirun instead of as a singleton, even with those components active. Seems to be something about singleton operations and those components?

ggouaillardet commented 1 week ago

@sukanka can you confirm you are testing the C program? also, are you running in singleton mode (e.g. ./a.out) or via mpirun mpirun works for me too.

[gilles@arch ~]$ pacman -Q openmpi openpmix
openmpi 5.0.5-2
openpmix 5.0.3-1
[gilles@arch ~]$ mpicc -o testspawn testspawn.c 
[gilles@arch ~]$ mpirun -np 1 ./testspawn
Hello, I am rank 1 in the merged comm
Hello, I am rank 3 in the merged comm
Hello, I am rank 2 in the merged comm
Hello, I am rank 0 in the merged comm
[gilles@arch ~]$

rhc54 commented 1 week ago

FWIW: quick check indicates that it is hanging in the MPI_Barrier across the intercomm. Removing that line, or disabling han,hcoll, resolves the problem.

rhc54 commented 1 week ago

@ggouaillardet You only see the problem in singleton mode.

ggouaillardet commented 1 week ago

@rhc54 I only see the problem in singleton mode (and with coll/han), but wanted to confirm the reporter.

fwiw, in singleton mode, the workaround (credit is yours) is

$ OMPI_MCA_coll=^han,hcoll ./a.out`

rhc54 commented 1 week ago

Actually, @hppritcha came up with that workaround.

You might want to check that the singleton is finding the modex info for the child procs - I'm guessing that it doesn't and hangs in the modex_recv call. But that is just a (somewhat educated) guess. Would, however, explain why it all works when run under mpirun. Not sure what the coll/han components have to do with it - maybe something about how they call modex_recv can lead them to block?

rhc54 commented 1 week ago

Ohhhh...you know what? That singleton "pushes" its connection info during MPI_Init, which since it was not started by mpirun, means that the connection info goes nowhere. It then kicks off a background mpirun to shepherd the child job - but that mpirun never gets the singleton's connection info! So now when we try to pass the connection info around, it is missing the singleton's connection info.

Thus, it is likely that the singleton "knows" how to connect to the child job - but the child job has no connection info for the singleton parent. Might be something worth checking.

sukanka commented 1 week ago

@sukanka can you confirm you are testing the C program? also, are you running in singleton mode (e.g. ./a.out) or via mpirun mpirun works for me too.

Yeah, I'm testing the C program and running in singleton mode. ( mpirun works for me.)

➜  test ./a.out
Hello, I am rank 0 in the merged comm
Hello, I am rank 2 in the merged comm
Hello, I am rank 3 in the merged comm
Hello, I am rank 1 in the merged comm
# stuck here

openmpi 5.0.5-2
openpmix 5.0.3-1

And the workaround OMPI_MCA_coll=^han,hcoll ./a.out works too.

rhc54 commented 1 week ago

Ohhhh...you know what? That singleton "pushes" its connection info during MPI_Init, which since it was not started by mpirun, means that the connection info goes nowhere. It then kicks off a background mpirun to shepherd the child job - but that mpirun never gets the singleton's connection info! So now when we try to pass the connection info around, it is missing the singleton's connection info.

Thus, it is likely that the singleton "knows" how to connect to the child job - but the child job has no connection info for the singleton parent. Might be something worth checking.

Well, I stand corrected - there is a call to PMIx_Commit in the dpm code that pushes the connection info up to the DVM after it is fork/exec'd. So I have nothing further I can contribute - looks like something in han/hcoll.

bosilca commented 1 week ago

That fix makes little sense, not saying it does not work just saying it might looks like addressing the bug but that's not what it does. Put it simply hcoll can only work for special hardware setups, and both of these collective components disable themselves for intercoms (check the comm_query).

hppritcha commented 1 week ago

Well George I recommended that because in gdb traceback i saw a bunch of ranks blocked in some kind of hcoll calls. probably when doing an allreduce for a cid for some part of the merge operation. Its likely the user doesn't even have that installed (which is probably a good thing).

rhc54 commented 1 week ago

I know nothing about the coll system any more, but FWIW everything runs fine if I simply remove the MPI_Barrier call. It seems that the barrier "hangs" if one of the intercomm members is a singleton - but works fine if all members are not singletons.

I saw no problems getting thru the intercomm merge operation.

hppritcha commented 1 week ago

i'll take a look in to this.

hppritcha commented 1 week ago

gdb says a lot (can reproduce with a single child process)

CHILD PROCESS

(gdb) bt
#0  0x00007ffff291f0aa in uct_rc_mlx5_iface_progress_cyclic () from /lib64/ucx/libuct_ib.so.0
#1  0x00007ffff66c829a in ucp_worker_progress () from /lib64/libucp.so.0
#2  0x00007ffff79b6d51 in mca_pml_ucx_progress () at pml_ucx.c:588
#3  0x00007ffff6d07ba5 in opal_progress () at runtime/opal_progress.c:224 
#4  0x00007ffff76ce22f in ompi_request_wait_completion (req=0x135efd8) at ../ompi/request/request.h:493
#5  0x00007ffff76ce298 in ompi_request_default_wait (req_ptr=0x7fffffffcc30, status=0x7fffffffcc10) at request/req_wait.c:40
#6  0x00007ffff779e2ed in ompi_coll_base_sendrecv_actual (sendbuf=0xe7cc20, scount=1, sdatatype=0x7ffff7d4d3c0 <ompi_mpi_int>, dest=0, stag=-12, 
    recvbuf=0x7fffffffcf3c, rcount=1, rdatatype=0x7ffff7d4d3c0 <ompi_mpi_int>, source=0, rtag=-12, comm=0x923670, status=0x0)
    at base/coll_base_util.c:66
#7  0x00007ffff77a0e97 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0x1, rbuf=0x7fffffffcf3c, count=1, 
    dtype=0x7ffff7d4d3c0 <ompi_mpi_int>, op=0x7ffff7d85180 <ompi_mpi_op_max>, comm=0x923670, module=0x135f0d0) at base/coll_base_allreduce.c:223
#8  0x00007ffff77a120d in ompi_coll_base_allreduce_intra_ring (sbuf=0x1, rbuf=0x7fffffffcf3c, count=1, dtype=0x7ffff7d4d3c0 <ompi_mpi_int>, 
    op=0x7ffff7d85180 <ompi_mpi_op_max>, comm=0x923670, module=0x135f0d0) at base/coll_base_allreduce.c:377
#9  0x00007ffff7826d83 in ompi_coll_tuned_allreduce_intra_do_this (sbuf=0x1, rbuf=0x7fffffffcf3c, count=1, dtype=0x7ffff7d4d3c0 <ompi_mpi_int>, 
    op=0x7ffff7d85180 <ompi_mpi_op_max>, comm=0x923670, module=0x135f0d0, algorithm=4, faninout=0, segsize=0)
    at coll_tuned_allreduce_decision.c:145
#10 0x00007ffff781d9ae in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0x1, rbuf=0x7fffffffcf3c, count=1, dtype=0x7ffff7d4d3c0 <ompi_mpi_int>, 
    op=0x7ffff7d85180 <ompi_mpi_op_max>, comm=0x923670, module=0x135f0d0) at coll_tuned_decision_fixed.c:216
#11 0x00007ffff77edac8 in mca_coll_han_comm_create_new (comm=0x923670, han_module=0x7a4d00) at coll_han_subcomms.c:105
#12 0x00007ffff77ca956 in mca_coll_han_barrier_intra_simple (comm=0x923670, module=0x7a4d00) at coll_han_barrier.c:37
#13 0x00007ffff77e8d27 in mca_coll_han_barrier_intra_dynamic (comm=0x923670, module=0x7a4d00) at coll_han_dynamic.c:815
#14 0x00007ffff76f4963 in PMPI_Barrier (comm=0x923670) at barrier.c:76
#15 0x0000000000400b8f in main ()

PARENT PROCESS
#0  0x00007ffff79ac1bc in opal_atomic_add_fetch_32 (addr=0x7ffff7d76868 <ompi_part_persist+136>, value=1)
    at ../../../../opal/include/opal/sys/gcc_builtin/atomic.h:252
#1  0x00007ffff79ad26a in mca_part_persist_progress () at ../../../../ompi/mca/part/persist/part_persist.h:166
#2  0x00007ffff6d07ba5 in opal_progress () at runtime/opal_progress.c:224
#3  0x00007ffff76ce22f in ompi_request_wait_completion (req=0x135ad98) at ../ompi/request/request.h:493
#4  0x00007ffff76ce298 in ompi_request_default_wait (req_ptr=0x7fffffffd6f0, status=0x7fffffffd6d0) at request/req_wait.c:40
#5  0x00007ffff77ab85d in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, source=1, rtag=-16, comm=0xbf0010) at base/coll_base_barrier.c:64
#6  0x00007ffff77abd3b in ompi_coll_base_barrier_intra_recursivedoubling (comm=0xbf0010, module=0x135b080) at base/coll_base_barrier.c:235
#7  0x00007ffff78281a4 in ompi_coll_tuned_barrier_intra_do_this (comm=0xbf0010, module=0x135b080, algorithm=3, faninout=0, segsize=0)
    at coll_tuned_barrier_decision.c:101
#8  0x00007ffff781e14d in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0xbf0010, module=0x135b080) at coll_tuned_decision_fixed.c:500
#9  0x00007ffff76f4963 in PMPI_Barrier (comm=0xbf0010) at barrier.c:76

hppritcha commented 1 week ago

I rebuilt OMPI main without hcoll support and verified that it suffices just to disable han to avoid this "confusion about which MCA coll component to use" problem that seems to occur in this singleton launch plus certain collectives on communicators involving both parent and child processes problem use case.

bosilca commented 1 week ago

I see what's going on. First, according to the code I found in #11749 the MPI_Barrier is not on the intercom but on an intracom, so having han or hcoll there would make sense.

Except ...

HAN, and I assume HCOLL are disabled in the singleton, because there is a single process so we don't need any fancy collectives. On the children they are enabled because by that point there are more than one process, so HAN and/or hcoll make sense. That's why disabling them fixes the hang, as it forces the children to use tuned, and matching the algorithm selected on the parent. Let me fiddle a little with HAN initialization to find a way to address this.

bosilca commented 6 days ago

My assumption above was correct, however the root cause was not. Basically, the two groups have different knowledge about each other: the original group correctly identified the spawned processes as local and therefore disabled HAN. The spawned processes however seem to have no knowledge about the location of the original processes, assume they are not local, so HAN make sense. At the first collective communication their selection logic diverge, one group uses tuned and the other han, with a guaranteed deadlock.

As a result, all the solutions proposed in this thread are incorrect, disabled some collective components is a bandaid not a real solution. The real solution is to make sure the knowledge about the processes location is symmetric between the parent and the sawned group.

rhc54 commented 6 days ago

Would appreciate a little help understanding the problem. Are you saying that the core issue is that the spawned procs are getting an incorrect response to this request:

            OPAL_MODEX_RECV_VALUE_OPTIONAL(rc, PMIX_LOCAL_PEERS,
                                           &wildcard_rank, &val, PMIX_STRING);

when asking about the parent procs?

bosilca commented 6 days ago

I don't know, I did not looked into that particular aspect. What I noticed is that when the child processes is looking for the location of the parent process, which at that point is part of the proc_t struct, is not getting the right answer. Basically, OPAL_PROC_ON_LOCAL_NODE return false, when the parent is indeed on the same node.

rhc54 commented 6 days ago

Okay, I can take a peek at some point. If it is in PMIx or PRRTE, the fix won't be available until late this year or early next year as I just did the final release in the current series for those projects. Probably means any eventual fix in OMPI won't be available until OMPI v6 appears.

rhc54 commented 6 days ago

We should probably go ahead and close this as "will not fix" since it won't be fixed in the OMPI v5 series.

Just to clarify George's comment, this problem only exists if the parent process is a singleton. It notably does not exist if you start the parent process with mpirun, or maybe even if you execute the parent as a singleton in the presence of a PRRTE DVM. Not entirely sure why at this time - need to check that second option a bit more.

So @sukanka, if you cannot wait until sometime next year for OMPI v6, your best bet is to simply use one of the known good methods for starting the parent. Alternatively, the fix will eventually appear in either or both of PMIx and PRRTE (assuming it isn't ultimately a problem in OMPI itself), and you can then build against an external updated version of them - but that won't happen until late this year or early next year.

sukanka commented 5 days ago

Thank you all for the answers. I'd like to switch to a good method before the fix appears.

rhc54 commented 5 days ago

I'd like to switch to a good method before the fix appears.

Just start your parent process with mpirun -n 1 ./foo and you'll be fine

rhc54 commented 5 days ago

Did some further digging into this and found a solution. Good and bad news. Changes are relatively minor, but it unfortunately requires changes in all three projects - OMPI, PMIx, and PRRTE. I have filed PRs accordingly:

https://github.com/openpmix/openpmix/pull/3445 https://github.com/openpmix/prrte/pull/2070 https://github.com/open-mpi/ompi/pull/12920

No idea on when those changes might appear in releases, but I would guess not for awhile. I am working on a little more aesthetically pleasing alternative fix (will coexist with the above as there is no harm in having both methods), but that won't appear for another week or two (additional changes should be confined to just PMIx and PRRTE).

sukanka commented 5 days ago

Did some further digging into this and found a solution. Good and bad news. Changes are relatively minor, but it unfortunately requires changes in all three projects - OMPI, PMIx, and PRRTE. I have filed PRs accordingly:

openpmix/openpmix#3445 openpmix/prrte#2070 #12920

No idea on when those changes might appear in releases, but I would guess not for awhile. I am working on a little more aesthetically pleasing alternative fix (will coexist with the above as there is no harm in having both methods), but that won't appear for another week or two (additional changes should be confined to just PMIx and PRRTE).

Thanks a lot! These patches work. I just rebuilt openmpi, openpmix and prrte with the patches above. And the MWEs (both C and python version) work now.

I will file a bug report at Archlinux packages once the final fix is ready, so I don't have to wait for openmpi-6.0

Just start your parent process with mpirun -n 1 ./foo and you'll be fine

BTW, how can I achieve this with mpi4py (The example script in https://github.com/open-mpi/ompi/issues/11749#issue-1750042919)? As in the YADE project, we just use mpi4py.

rhc54 commented 5 days ago

BTW, how can I achieve this with mpi4py (The example script in https://github.com/open-mpi/ompi/issues/11749#issue-1750042919)? As in the YADE project, we just use mpi4py.

Just do mpirun -n 1 python3 script.py

open-mpi / ompi