Open dalcinl opened 10 months ago
@dalcinl Are you using Open MPI v5.0.2?
@janjust Last time I tried it was ompi@master. At this point I'm losing track of all the cumulated issues with their respective branches.
@dalcinl thanks for putting this together. Just to check you are not trying with these failures to run oversubscribed, correct?
Just to check you are not trying with these failures to run oversubscribed
Oh, hold on... Yes, I may eventually run oversubscribed if the tests spawn too many processes. But I'm setting the proper MCA parameters to allow for that. Am I missing something? Also, see above, I'm reporting failures even in singleton init mode, and in that case I don't think I'm not oversubscribing the machine. Also note that I'm reporting that the deadlocks are not always reproducible, so any potential issue with oversubscription does not seems to be perfectly reproducible.
I can repeat my local tests tomorrow with current main and then report the outcome.
Folks, I've updated the description. All my local tests are with ompi@main.
@janjust My CI also failed with deadlock using ompi@v5.0.x, see here.
I think i have a fix in prrte for the singleton problem:
python test/test_spawn.py -v
odd that you don't see the assert that I observed.
I''m having problems reproducing this one:
mpiexec -n 10 python test/test_spawn.py -v -k testArgsOnlyAtRoot
could you do a run with
mpiexec --display allocation
so i can try better to reproduce?
@hppritcha This is what I get from mpiexec --display allocation ...
====================== ALLOCATED NODES ======================
kw61149: slots=1 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
aliases: kw61149
=================================================================
====================== ALLOCATED NODES ======================
kw61149: slots=64 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: kw61149
=================================================================
====================== ALLOCATED NODES ======================
kw61149: slots=64 max_slots=0 slots_inuse=0 state=UP
Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
aliases: kw61149
=================================================================
I could possibly help, but I can do nothing without at least some description of what these test do. Just naming a test in somebody's test suite and saying "it doesn't work" isn't very helpful for those of us not dedicated to using that test suite.
Ultimately, I don't really care - if I can help, then I will. But if you'd rather not provide the info, then that's okay too - Howard can struggle on his own.
@rhc54 I'll submit a trivial reproducer here as soon as I can. The issue is not particular of my testsuite, any spawn example in singleton init mode with a relocated ompi install tree should suffice (issue: setting OPAL_PREFIX is not enough, PATH has to be updated as well for spawn to succeed).
I'm not concerned about that one - @hppritcha indicates he is already addressing it. I'm talking about the other ones you cite here.
@hppritcha FWIW: I'm reworking dpm.c to use PMIx_Group
instead of the flakier "publish/lookup" handshake. No idea how that will impact these issues - as I have no idea what these issues are đŸ˜„
Sorry, I mixed up issues, I was talking about #12349. Regarding spawn testsuites, what mine does that probably no other one does is to issue spawn/spawn_multiple calls in rapid succession from both COMM_SELF and COMM_WORLD, asking for a lot of short-lived child processes, maybe oversubscribing heavily the machine, including testing things like spawn arguments only relevant at the root process. The failures smell to me as race conditions. Maybe the key to the issue is the flaky "publish/lookup" handshake you mentioned above. Your update may very well fix things for good.
@rhc54 @hppritcha Here you have a C reproducer, as simple as it can get.
#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>
int main(int argc, char *argv[])
{
int maxnp = argc >= 2 ? atoi(argv[1]) : 1;
int provided;
MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);
int rank;
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Comm comm;
MPI_Comm_get_parent(&comm);
if (MPI_COMM_NULL == comm) {
for (int i=0; i<100; i++) {
if (0 == rank) printf("%d\n",i);
MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, maxnp,
MPI_INFO_NULL, 0,
MPI_COMM_SELF, &comm,
MPI_ERRCODES_IGNORE);
MPI_Barrier(comm);
MPI_Comm_disconnect(&comm);
}
} else {
MPI_Barrier(comm);
MPI_Comm_disconnect(&comm);
}
MPI_Finalize();
return 0;
}
Build and run like below. Other test failing cases can be generated by changing the -np arg to mpiexec and the cmdline arg to the program. You can also change SELF -> WORLD in the C code above.
$ mpicc spawn.c
$ mpiexec -n 10 ./a.out 1
0
1
2
...
11
[kw61149:1737636] *** Process received signal ***
[kw61149:1737636] Signal: Segmentation fault (11)
[kw61149:1737636] Signal code: Address not mapped (1)
[kw61149:1737636] Failing at address: 0x180
[kw61149:1737636] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7f053345c9a0]
[kw61149:1737636] [ 1] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x386b4a)[0x7f0533986b4a]
[kw61149:1737636] [ 2] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x123)[0x7f0533988d32]
[kw61149:1737636] [ 3] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(+0xc76c5)[0x7f05333a26c5]
[kw61149:1737636] [ 4] /home/devel/mpi/openmpi/main/lib/libevent_core-2.1.so.7(+0x1c645)[0x7f0533c5b645]
[kw61149:1737636] [ 5] /home/devel/mpi/openmpi/main/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7f0533c5bccf]
[kw61149:1737636] [ 6] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(+0x23ef1)[0x7f05332feef1]
[kw61149:1737636] [ 7] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(opal_progress+0xa7)[0x7f05332fefaa]
[kw61149:1737636] [ 8] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1cd)[0x7f053334aa1a]
[kw61149:1737636] [ 9] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x6c4bf)[0x7f053366c4bf]
[kw61149:1737636] [10] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_comm_nextcid+0x7a)[0x7f053366e0cf]
[kw61149:1737636] [11] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_dpm_connect_accept+0x3cd6)[0x7f053367ffca]
[kw61149:1737636] [12] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_dpm_dyn_init+0xc6)[0x7f0533689df8]
[kw61149:1737636] [13] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_mpi_init+0x750)[0x7f05336abd47]
[kw61149:1737636] [14] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(PMPI_Init_thread+0xb6)[0x7f0533713c24]
[kw61149:1737636] [15] ./a.out[0x4011f6]
[kw61149:1737636] [16] /lib64/libc.so.6(+0x2814a)[0x7f053344614a]
[kw61149:1737636] [17] /lib64/libc.so.6(__libc_start_main+0x8b)[0x7f053344620b]
[kw61149:1737636] [18] ./a.out[0x4010e5]
[kw61149:1737636] *** End of error message ***
[kw61149:00000] *** An error occurred in Socket closed
[kw61149:00000] *** reported by process [1000800257,5]
[kw61149:00000] *** on a NULL communicator
[kw61149:00000] *** Unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] *** and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
[kw61149:1737659] OPAL ERROR: Server not available in file ../../ompi/ompi/dpm/dpm.c at line 406
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
LAUNCHER JOB OBJECT NOT FOUND
I can somewhat reproduce this - it either hangs at the very end, or it segfaults somewhere before the end. Always happens in that "next_cid" code. Afraid that code is deep voodoo and I have no idea what it is doing, or why.
My rewrite gets rid of all that stuff, but it will take me awhile to complete it. I've also been asked to leave the old code in parallel, so I'll add an MCA param to select between the two methods.
I'll add an MCA param to select between the two methods.
I hope your new method will become the default... The old code is evidently broken.
@rhc54 After your diagnosis, what would you suggest for the mpi4py test suite? Should I just skip all these spawn tests as know failures, at least until we all have your new implementation available?
You might as well skip them - they will just keep failing for now, and I doubt anyone will take the time to try and work thru that code in OMPI to figure out the problem.
FWIW: in fairness, the existing code seems to work fine when not heavily stressed as we haven't heard complaints from the field. Not saying there's anything wrong with your test - it is technically correct and we therefore should pass it. Just noting that the test is higher-stress than we see in practice.
We skipped all the connect/accept/spawn tests for years, that's why we are in this mess.
Let's be fair here - we actually do run connect/accept/spawn tests, but they are low-stress versions.. The only reason this one fails is because it is a high-stress test with a tight loop over comm-spawn. The current code works just fine for people who actually use it, which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn).
Modifying it to support high-stress operations isn't trivial, but probably doable. I concur with other comments, though, that this isn't a high priority.
which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn).
I have had quite a bit of emails over the years asking about spawn-related issues from Python folks using mpi4py. That's the reason I stress-test implementations within my test suite.
Modifying it to support high-stress operations isn't trivial,
I'm not even sure how high-stress is precisely defined. How could I modify my tests to be lower-stress? I've already backeted my spawn calls with barriers, but that's clearly not enough. Should I use sleep() or something like that? Should I serialize all spawn calls from COMM_SELF? I'm really afraid that if I stop testing spawn functionality and don't keep an eye on it, at some point it will become simply unusable.
This storyline (nobody uses this feature) is getting old. I had similar contact as @dalcinl, people tried to use but it was broken so they found another way.
I have no idea what low-stress and high-stress testing could be. It works or it doesn't.
Sigh - seems a rather pointless debate, doesn't it? Fact is, nobody in the OMPI community has historically been inclined to spend time worrying about it, so the debate is rather moot.
Kudos to @hppritcha for trying to take it on, or at least portions of it (e.g., the singleton comm-spawn case).
Looking at the last 2 years of updates in the DPM related code, many of us (you/ICM/LANL/Amazon/UTK) tried to do so. Smaller steps but it got to a point where it kind-of-work. The only thing left is to have a solution that make it works everywhere, because this is a critical feature outside the HPC market.
Agreed - my "low stress" was simply a single invocation of "comm_spawn" by a process in a job. "High-stress" is when a process in a job sits there and calls "comm_spawn" in a loop. Fills the system with lots of coupled jobs, requires that the system (both MPI and RTE) be able to fully cleanup/recover between jobs, etc.
We have historically been satisfied with making the "low stress" operation work. Occasionally, I'd take a crack at the "loop spawn" test and at least keep it from hanging, but it was always a struggle and didn't last for very long. And my "loop spawn" test was very simple, just looping over spawn and disconnecting. Never actually had the conjoined jobs do anything.
I agree that it is an important feature outside HPC, and it obviously should work. Perhaps my new approach will succeed where we previously failed - have to wait and see.
There are remaining issues related to spawn when running the mpi4py testsuite. I'm able to reproduce them locally.
First, you need to switch to branch
testing/ompi-dpm
, otherwise some of the reproducers below will be skip as know failures.I'm configuring ompi@main the following way:
I've enabled oversubscription via both Open MPI and PRTE config files.
Afterwards, try the following:
1) I cannot run in singleton mode:
2) The following test fails when using a large number of MPI processes, let say 10, you may need more:
Sometimes I get a segfault, sometimes a deadlock, and a few times the run may run to completion.
The following narrowing of tests may help figure out the problem:
It may run OK many times, but eventually I get a failure and the following output:
This other narrowed down test also have issues, but it does not always fail:
3) The following test deadlocks when running in 4 or more MPI processes:
It may run occasionally, but most of the times it deadlocks.
cc @hppritcha