Closed lrbison closed 1 year ago
Another thing that might help track this down: could be helpful to strip this down to the bare minimum. Use an application that consists of PMIx_Init
, PMIx_Fence
, and PMIx_Finalize
. This will run much faster than the MPI version and remove all the MPI code overhead from the equation. If you can cycle that application without hitting problems, then the issue is something to do with the interaction with OMPI and not just something inside of PMIx.
Doesn't rule out that PMIx has a problem with the shmem rendezvous as the address spaces might collide. But it would rule out a problem in the PMIx code itself.
Okay, I went ahead and ran a PMIx_Init
, PMIx_Fence
, and PMIx_Finalize
app on a continuous cycle using a single VM, heavily oversubscribed to ensure things were loaded. Without the gds/shmem
component, it ran forever without a problem. However, with the gds/shmem
component active, it hung after about 530 iterations.
Biggest difference: I did not see any lost process. All procs were present. I checked prterun
and it was sitting patiently waiting for events. I also checked a couple of procs and found them in PMIx_Fence
waiting for release. I didn't check them all to see if one was hung in PMIx_Init
.
FWIW: my cmd line was prterun -n 44 --map-by :oversubscribe simple
where simple
was the trivial app described above. This VM is pretty weak, so 44 procs was a significant burden on it.
@samuelkgutierrez Perhaps that formula will help you reproduce it?
Thank you, @rhc54. This is certainly helpful, thank you. I'm debugging as we speak.
Hi, @wckzhang. Can you please do me a favor? I've found the likely cause of your hangs using shmem. I've updated the way we do shared-memory segment reference counting and the fix is in OpenPMIx master (https://github.com/openpmix/openpmix/pull/3051).
When you have a moment, could you please retest using both OpenPMIx and PRRTE master? The only thing is that you'll have to unignore the shmem component before running autogen.pl by adding your user moniker to src/mca/gds/shmem/.pmix_unignore.
Thank you and please let me know if you have any questions.
@wckzhang and I spoke offline. He was kind enough to give me access to his environment. I can verify that after 200 test runs I did not see any hangs. Without the fix in place, the program hung after about 25 iterations. @rhc54 looks like https://github.com/openpmix/openpmix/pull/3051 fixes this issue.
Fixed by - https://github.com/openpmix/openpmix/pull/3051 - Closing
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
main branch
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
From source:
./configure --with-sge --without-verbs --with-libfabric=/opt/amazon/efa --disable-man-pages --enable-debug --prefix=/home/ec2-user/ompi
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Before and after https://github.com/open-mpi/ompi/pull/11563
I'm looking at the stack trace and hangs from after the PR, but I suspect it's the same issue as before the PR, but happening less for some reason.
Please describe the system on which you are running
Details of the problem
Hang during startup of a very simple all-gather test. (ibm/collective/allgatherv). Seems to be more common with larger runs. Test case is 512 ranks on 8 hosts.
By inserting prints at the front I can see the allgatherv program is launched, can print "hello", but sometimes it hangs forever during
MPI_Init()
.Stack trace:
Launch command:
$PREFIX/bin/mpirun --prefix=$PREFIX -np 512 -N 64 -hostfile /home/ec2-user/PortaFiducia/hostfile /home/ec2-user/ompi-tests/ibm/collective/allgatherv