open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.14k stars 859 forks source link

Hang during MPI_Init #11566

Closed lrbison closed 1 year ago

lrbison commented 1 year ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

main branch

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

From source: ./configure --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --enable-debug --prefix=/home/ec2-user/ompi

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

Before and after https://github.com/open-mpi/ompi/pull/11563

I'm looking at the stack trace and hangs from after the PR, but I suspect it's the same issue as before the PR, but happening less for some reason.

Please describe the system on which you are running


Details of the problem

Hang during startup of a very simple all-gather test. (ibm/collective/allgatherv). Seems to be more common with larger runs. Test case is 512 ranks on 8 hosts.

By inserting prints at the front I can see the allgatherv program is launched, can print "hello", but sometimes it hangs forever during MPI_Init().

Stack trace:

#0  0x0000ffff7ee07f7c in nanosleep () from /lib64/libc.so.6
#1  0x0000ffff7ee2fd40 in usleep () from /lib64/libc.so.6
#2  0x0000ffff7efecb10 in ompi_mpi_instance_init_common (argc=1, argv=0xfffff5404768) at instance/instance.c:739
#3  0x0000ffff7efece9c in ompi_mpi_instance_init (ts_level=0, info=0xffff7f4891e8 <ompi_mpi_info_null>, errhandler=0xffff7f481228 <ompi_mpi_errors_are_fatal>,
    instance=0xffff7f491cc8 <ompi_mpi_instance_default>, argc=1, argv=0xfffff5404768) at instance/instance.c:814
#4  0x0000ffff7efdb1b0 in ompi_mpi_init (argc=1, argv=0xfffff5404768, requested=0, provided=0xfffff54045ac, reinit_ok=false) at runtime/ompi_mpi_init.c:359
#5  0x0000ffff7f0512a4 in PMPI_Init (argc=0xfffff54045dc, argv=0xfffff54045d0) at init.c:67

Launch command:

$PREFIX/bin/mpirun --prefix=$PREFIX -np 512 -N 64 -hostfile /home/ec2-user/PortaFiducia/hostfile /home/ec2-user/ompi-tests/ibm/collective/allgatherv

rhc54 commented 1 year ago

Another thing that might help track this down: could be helpful to strip this down to the bare minimum. Use an application that consists of PMIx_Init, PMIx_Fence, and PMIx_Finalize. This will run much faster than the MPI version and remove all the MPI code overhead from the equation. If you can cycle that application without hitting problems, then the issue is something to do with the interaction with OMPI and not just something inside of PMIx.

Doesn't rule out that PMIx has a problem with the shmem rendezvous as the address spaces might collide. But it would rule out a problem in the PMIx code itself.

rhc54 commented 1 year ago

Okay, I went ahead and ran a PMIx_Init, PMIx_Fence, and PMIx_Finalize app on a continuous cycle using a single VM, heavily oversubscribed to ensure things were loaded. Without the gds/shmem component, it ran forever without a problem. However, with the gds/shmem component active, it hung after about 530 iterations.

Biggest difference: I did not see any lost process. All procs were present. I checked prterun and it was sitting patiently waiting for events. I also checked a couple of procs and found them in PMIx_Fence waiting for release. I didn't check them all to see if one was hung in PMIx_Init.

FWIW: my cmd line was prterun -n 44 --map-by :oversubscribe simple where simple was the trivial app described above. This VM is pretty weak, so 44 procs was a significant burden on it.

@samuelkgutierrez Perhaps that formula will help you reproduce it?

samuelkgutierrez commented 1 year ago

Thank you, @rhc54. This is certainly helpful, thank you. I'm debugging as we speak.

samuelkgutierrez commented 1 year ago

Hi, @wckzhang. Can you please do me a favor? I've found the likely cause of your hangs using shmem. I've updated the way we do shared-memory segment reference counting and the fix is in OpenPMIx master (https://github.com/openpmix/openpmix/pull/3051).

When you have a moment, could you please retest using both OpenPMIx and PRRTE master? The only thing is that you'll have to unignore the shmem component before running autogen.pl by adding your user moniker to src/mca/gds/shmem/.pmix_unignore.

Thank you and please let me know if you have any questions.

samuelkgutierrez commented 1 year ago

@wckzhang and I spoke offline. He was kind enough to give me access to his environment. I can verify that after 200 test runs I did not see any hangs. Without the fix in place, the program hung after about 25 iterations. @rhc54 looks like https://github.com/openpmix/openpmix/pull/3051 fixes this issue.

wckzhang commented 1 year ago

Fixed by - https://github.com/openpmix/openpmix/pull/3051 - Closing