open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.19k stars 864 forks source link

mpi4py: Remaining spawn/accept/connect issues #12307

Open dalcinl opened 10 months ago

dalcinl commented 10 months ago

There are remaining issues related to spawn when running the mpi4py testsuite. I'm able to reproduce them locally.

First, you need to switch to branch testing/ompi-dpm, otherwise some of the reproducers below will be skip as know failures.

cd mpi4py # git repo clone
git fetch && git checkout testing/ompi-dpm

I'm configuring ompi@main the following way:

options=(
    --prefix=/home/devel/mpi/openmpi/dev
    --without-ofi
    --without-ucx
    --without-psm2
    --without-cuda
    --without-rocm
    --with-pmix=internal
    --with-prrte=internal
    --with-libevent=internal
    --with-hwloc=internal
    --enable-debug
    --enable-mem-debug
    --disable-man-pages
    --disable-sphinx
)
./configure "${options[@]}"

I've enabled oversubscription via both Open MPI and PRTE config files.

$ cat ~/.openmpi/mca-params.conf 
rmaps_default_mapping_policy = :oversubscribe
$ cat ~/.prte/mca-params.conf 
rmaps_default_mapping_policy = :oversubscribe

Afterwards, try the following:

1) I cannot run in singleton mode:

$ python test/test_spawn.py -v
[kw61149:525865] shmem: mmap: an error occurred while determining whether or not /tmp/ompi.kw61149.1000/jf.0/3608084480/shared_mem_cuda_pool.kw61149 could be created.
[kw61149:525865] create_and_attach: unable to create shared memory BTL coordinating structure :: size 134217728 
[0@kw61149] Python 3.12.1 (/usr/bin/python)
[0@kw61149] numpy 1.26.3 (/home/dalcinl/.local/lib/python3.12/site-packages/numpy)
[0@kw61149] MPI 3.1 (Open MPI 5.1.0)
[0@kw61149] mpi4py 4.0.0.dev0 (/home/dalcinl/Devel/mpi4py/src/mpi4py)
testArgsBad (__main__.TestSpawnMultipleSelf.testArgsBad) ... ok
testArgsOnlyAtRoot (__main__.TestSpawnMultipleSelf.testArgsOnlyAtRoot) ... ok
testCommSpawn (__main__.TestSpawnMultipleSelf.testCommSpawn) ... ok
testCommSpawnDefaults1 (__main__.TestSpawnMultipleSelf.testCommSpawnDefaults1) ... prte: ../../../../../ompi/3rd-party/openpmix/src/class/pmix_list.c:62: pmix_list_item_destruct: Assertion `0 == item->pmix_list_item_refcount' failed.
ERROR
testCommSpawnDefaults2 (__main__.TestSpawnMultipleSelf.testCommSpawnDefaults2) ... ERROR
...

2) The following test fails when using a large number of MPI processes, let say 10, you may need more:

mpiexec -n 10 python test/test_spawn.py -v

Sometimes I get a segfault, sometimes a deadlock, and a few times the run may run to completion.

The following narrowing of tests may help figure out the problem:

mpiexec -n 10 python test/test_spawn.py -v -k testArgsOnlyAtRoot

It may run OK many times, but eventually I get a failure and the following output:

testArgsOnlyAtRoot (__main__.TestSpawnSingleSelfMany.testArgsOnlyAtRoot) ... [kw61149:00000] *** An error occurred in Socket closed3) The following test deadlocks when running in 4 or more MPI processes:

This other narrowed down test also have issues, but it does not always fail:

mpiexec -n 10 python test/test_spawn.py -v -k testNoArgs
[kw61149:1826801] *** Process received signal ***
[kw61149:1826801] Signal: Segmentation fault (11)
[kw61149:1826801] Signal code: Address not mapped (1)
[kw61149:1826801] Failing at address: 0x180
[kw61149:1826801] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7fea10eaa9a0]
[kw61149:1826801] [ 1] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(+0x386b4a)[0x7fea02786b4a]
[kw61149:1826801] [ 2] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x123)[0x7fea02788d32]
[kw61149:1826801] [ 3] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(+0xc7661)[0x7fea02384661]
[kw61149:1826801] [ 4] /home/devel/mpi/openmpi/dev/lib/libevent_core-2.1.so.7(+0x1c645)[0x7fea02ea6645]
[kw61149:1826801] [ 5] /home/devel/mpi/openmpi/dev/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7fea02ea6ccf]
[kw61149:1826801] [ 6] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(+0x23ef1)[0x7fea022e0ef1]
[kw61149:1826801] [ 7] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(opal_progress+0xa7)[0x7fea022e0faa]
[kw61149:1826801] [ 8] /home/devel/mpi/openmpi/dev/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1cd)[0x7fea0232ca1a]
[kw61149:1826801] [ 9] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(+0x6c4bf)[0x7fea0246c4bf]
[kw61149:1826801] [10] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_comm_nextcid+0x7a)[0x7fea0246e0cf]
[kw61149:1826801] [11] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_dpm_connect_accept+0x3cd6)[0x7fea0247ffca]
[kw61149:1826801] [12] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_dpm_dyn_init+0xc6)[0x7fea02489df8]
[kw61149:1826801] [13] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(ompi_mpi_init+0x750)[0x7fea024abd47]
[kw61149:1826801] [14] /home/devel/mpi/openmpi/dev/lib/libmpi.so.0(PMPI_Init_thread+0xdc)[0x7fea02513c4a]

3) The following test deadlocks when running in 4 or more MPI processes:

mpiexec -n 4 python test/test_dynproc.py -v

It may run occasionally, but most of the times it deadlocks.

[kw61149:00000] *** reported by process [3119841281,6]
[kw61149:00000] *** on a NULL communicator
[kw61149:00000] *** Unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] ***    and MPI will try to terminate your MPI job as well)

cc @hppritcha

janjust commented 9 months ago

@dalcinl Are you using Open MPI v5.0.2?

dalcinl commented 9 months ago

@janjust Last time I tried it was ompi@master. At this point I'm losing track of all the cumulated issues with their respective branches.

hppritcha commented 9 months ago

@dalcinl thanks for putting this together. Just to check you are not trying with these failures to run oversubscribed, correct?

dalcinl commented 9 months ago

Just to check you are not trying with these failures to run oversubscribed

Oh, hold on... Yes, I may eventually run oversubscribed if the tests spawn too many processes. But I'm setting the proper MCA parameters to allow for that. Am I missing something? Also, see above, I'm reporting failures even in singleton init mode, and in that case I don't think I'm not oversubscribing the machine. Also note that I'm reporting that the deadlocks are not always reproducible, so any potential issue with oversubscription does not seems to be perfectly reproducible.

dalcinl commented 9 months ago

I can repeat my local tests tomorrow with current main and then report the outcome.

dalcinl commented 9 months ago

Folks, I've updated the description. All my local tests are with ompi@main.

@janjust My CI also failed with deadlock using ompi@v5.0.x, see here.

hppritcha commented 9 months ago

I think i have a fix in prrte for the singleton problem:

python test/test_spawn.py -v

odd that you don't see the assert that I observed.

I''m having problems reproducing this one:

mpiexec -n 10 python test/test_spawn.py -v -k testArgsOnlyAtRoot

could you do a run with

mpiexec --display allocation

so i can try better to reproduce?

dalcinl commented 9 months ago

@hppritcha This is what I get from mpiexec --display allocation ...

======================   ALLOCATED NODES   ======================
    kw61149: slots=1 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED
    aliases: kw61149
=================================================================

======================   ALLOCATED NODES   ======================
    kw61149: slots=64 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: kw61149
=================================================================

======================   ALLOCATED NODES   ======================
    kw61149: slots=64 max_slots=0 slots_inuse=0 state=UP
    Flags: DAEMON_LAUNCHED:LOCATION_VERIFIED:SLOTS_GIVEN
    aliases: kw61149
=================================================================
rhc54 commented 9 months ago

I could possibly help, but I can do nothing without at least some description of what these test do. Just naming a test in somebody's test suite and saying "it doesn't work" isn't very helpful for those of us not dedicated to using that test suite.

Ultimately, I don't really care - if I can help, then I will. But if you'd rather not provide the info, then that's okay too - Howard can struggle on his own.

dalcinl commented 9 months ago

@rhc54 I'll submit a trivial reproducer here as soon as I can. The issue is not particular of my testsuite, any spawn example in singleton init mode with a relocated ompi install tree should suffice (issue: setting OPAL_PREFIX is not enough, PATH has to be updated as well for spawn to succeed).

rhc54 commented 9 months ago

I'm not concerned about that one - @hppritcha indicates he is already addressing it. I'm talking about the other ones you cite here.

@hppritcha FWIW: I'm reworking dpm.c to use PMIx_Group instead of the flakier "publish/lookup" handshake. No idea how that will impact these issues - as I have no idea what these issues are đŸ˜„

dalcinl commented 9 months ago

Sorry, I mixed up issues, I was talking about #12349. Regarding spawn testsuites, what mine does that probably no other one does is to issue spawn/spawn_multiple calls in rapid succession from both COMM_SELF and COMM_WORLD, asking for a lot of short-lived child processes, maybe oversubscribing heavily the machine, including testing things like spawn arguments only relevant at the root process. The failures smell to me as race conditions. Maybe the key to the issue is the flaky "publish/lookup" handshake you mentioned above. Your update may very well fix things for good.

dalcinl commented 9 months ago

@rhc54 @hppritcha Here you have a C reproducer, as simple as it can get.

#include <mpi.h>
#include <stdlib.h>
#include <stdio.h>

int main(int argc, char *argv[])
{
  int maxnp = argc >= 2 ? atoi(argv[1]) : 1;

  int provided;
  MPI_Init_thread(&argc, &argv, MPI_THREAD_MULTIPLE, &provided);

  int rank;
  MPI_Comm_rank(MPI_COMM_WORLD, &rank);

  MPI_Comm comm;
  MPI_Comm_get_parent(&comm);

  if (MPI_COMM_NULL == comm) {

    for (int i=0; i<100; i++) {
      if (0 == rank) printf("%d\n",i);
      MPI_Comm_spawn(argv[0], MPI_ARGV_NULL, maxnp,
                     MPI_INFO_NULL, 0,
                     MPI_COMM_SELF, &comm,
                     MPI_ERRCODES_IGNORE);
      MPI_Barrier(comm);
      MPI_Comm_disconnect(&comm);
    }

  } else {

    MPI_Barrier(comm);
    MPI_Comm_disconnect(&comm);

  }

  MPI_Finalize();
  return 0;
}

Build and run like below. Other test failing cases can be generated by changing the -np arg to mpiexec and the cmdline arg to the program. You can also change SELF -> WORLD in the C code above.

$ mpicc spawn.c

$ mpiexec -n 10 ./a.out 1
0
1
2
...
11
[kw61149:1737636] *** Process received signal ***
[kw61149:1737636] Signal: Segmentation fault (11)
[kw61149:1737636] Signal code: Address not mapped (1)
[kw61149:1737636] Failing at address: 0x180
[kw61149:1737636] [ 0] /lib64/libc.so.6(+0x3e9a0)[0x7f053345c9a0]
[kw61149:1737636] [ 1] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x386b4a)[0x7f0533986b4a]
[kw61149:1737636] [ 2] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(mca_pml_ob1_recv_frag_callback_match+0x123)[0x7f0533988d32]
[kw61149:1737636] [ 3] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(+0xc76c5)[0x7f05333a26c5]
[kw61149:1737636] [ 4] /home/devel/mpi/openmpi/main/lib/libevent_core-2.1.so.7(+0x1c645)[0x7f0533c5b645]
[kw61149:1737636] [ 5] /home/devel/mpi/openmpi/main/lib/libevent_core-2.1.so.7(event_base_loop+0x47f)[0x7f0533c5bccf]
[kw61149:1737636] [ 6] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(+0x23ef1)[0x7f05332feef1]
[kw61149:1737636] [ 7] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(opal_progress+0xa7)[0x7f05332fefaa]
[kw61149:1737636] [ 8] /home/devel/mpi/openmpi/main/lib/libopen-pal.so.0(ompi_sync_wait_mt+0x1cd)[0x7f053334aa1a]
[kw61149:1737636] [ 9] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(+0x6c4bf)[0x7f053366c4bf]
[kw61149:1737636] [10] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_comm_nextcid+0x7a)[0x7f053366e0cf]
[kw61149:1737636] [11] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_dpm_connect_accept+0x3cd6)[0x7f053367ffca]
[kw61149:1737636] [12] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_dpm_dyn_init+0xc6)[0x7f0533689df8]
[kw61149:1737636] [13] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(ompi_mpi_init+0x750)[0x7f05336abd47]
[kw61149:1737636] [14] /home/devel/mpi/openmpi/main/lib/libmpi.so.0(PMPI_Init_thread+0xb6)[0x7f0533713c24]
[kw61149:1737636] [15] ./a.out[0x4011f6]
[kw61149:1737636] [16] /lib64/libc.so.6(+0x2814a)[0x7f053344614a]
[kw61149:1737636] [17] /lib64/libc.so.6(__libc_start_main+0x8b)[0x7f053344620b]
[kw61149:1737636] [18] ./a.out[0x4010e5]
[kw61149:1737636] *** End of error message ***
[kw61149:00000] *** An error occurred in Socket closed
[kw61149:00000] *** reported by process [1000800257,5]
[kw61149:00000] *** on a NULL communicator
[kw61149:00000] *** Unknown error
[kw61149:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[kw61149:00000] ***    and MPI will try to terminate your MPI job as well)
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
[kw61149:1737659] OPAL ERROR: Server not available in file ../../ompi/ompi/dpm/dpm.c at line 406
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
    This help section is empty because PRRTE was built without Sphinx.
--------------------------------------------------------------------------
LAUNCHER JOB OBJECT NOT FOUND
rhc54 commented 9 months ago

I can somewhat reproduce this - it either hangs at the very end, or it segfaults somewhere before the end. Always happens in that "next_cid" code. Afraid that code is deep voodoo and I have no idea what it is doing, or why.

My rewrite gets rid of all that stuff, but it will take me awhile to complete it. I've also been asked to leave the old code in parallel, so I'll add an MCA param to select between the two methods.

dalcinl commented 9 months ago

I'll add an MCA param to select between the two methods.

I hope your new method will become the default... The old code is evidently broken.

dalcinl commented 9 months ago

@rhc54 After your diagnosis, what would you suggest for the mpi4py test suite? Should I just skip all these spawn tests as know failures, at least until we all have your new implementation available?

rhc54 commented 9 months ago

You might as well skip them - they will just keep failing for now, and I doubt anyone will take the time to try and work thru that code in OMPI to figure out the problem.

rhc54 commented 9 months ago

FWIW: in fairness, the existing code seems to work fine when not heavily stressed as we haven't heard complaints from the field. Not saying there's anything wrong with your test - it is technically correct and we therefore should pass it. Just noting that the test is higher-stress than we see in practice.

bosilca commented 9 months ago

We skipped all the connect/accept/spawn tests for years, that's why we are in this mess.

rhc54 commented 9 months ago

Let's be fair here - we actually do run connect/accept/spawn tests, but they are low-stress versions.. The only reason this one fails is because it is a high-stress test with a tight loop over comm-spawn. The current code works just fine for people who actually use it, which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn).

Modifying it to support high-stress operations isn't trivial, but probably doable. I concur with other comments, though, that this isn't a high priority.

dalcinl commented 9 months ago

which is why we don't see complaints (well, plus of course the fact that hardly anyone uses comm-spawn).

I have had quite a bit of emails over the years asking about spawn-related issues from Python folks using mpi4py. That's the reason I stress-test implementations within my test suite.

Modifying it to support high-stress operations isn't trivial,

I'm not even sure how high-stress is precisely defined. How could I modify my tests to be lower-stress? I've already backeted my spawn calls with barriers, but that's clearly not enough. Should I use sleep() or something like that? Should I serialize all spawn calls from COMM_SELF? I'm really afraid that if I stop testing spawn functionality and don't keep an eye on it, at some point it will become simply unusable.

bosilca commented 9 months ago

This storyline (nobody uses this feature) is getting old. I had similar contact as @dalcinl, people tried to use but it was broken so they found another way.

I have no idea what low-stress and high-stress testing could be. It works or it doesn't.

rhc54 commented 9 months ago

Sigh - seems a rather pointless debate, doesn't it? Fact is, nobody in the OMPI community has historically been inclined to spend time worrying about it, so the debate is rather moot.

Kudos to @hppritcha for trying to take it on, or at least portions of it (e.g., the singleton comm-spawn case).

bosilca commented 9 months ago

Looking at the last 2 years of updates in the DPM related code, many of us (you/ICM/LANL/Amazon/UTK) tried to do so. Smaller steps but it got to a point where it kind-of-work. The only thing left is to have a solution that make it works everywhere, because this is a critical feature outside the HPC market.

rhc54 commented 9 months ago

Agreed - my "low stress" was simply a single invocation of "comm_spawn" by a process in a job. "High-stress" is when a process in a job sits there and calls "comm_spawn" in a loop. Fills the system with lots of coupled jobs, requires that the system (both MPI and RTE) be able to fully cleanup/recover between jobs, etc.

We have historically been satisfied with making the "low stress" operation work. Occasionally, I'd take a crack at the "loop spawn" test and at least keep it from hanging, but it was always a struggle and didn't last for very long. And my "loop spawn" test was very simple, just looping over spawn and disconnecting. Never actually had the conjoined jobs do anything.

I agree that it is an important feature outside HPC, and it obviously should work. Perhaps my new approach will succeed where we previously failed - have to wait and see.