Closed abouteiller closed 1 year ago
Hmmm...it is working fine for me on my Mac (single node) with current HEAD of both PMIx and PRRTE master branches:
$ prterun -n 4 --personality ompi --enable-recovery --with-ft ulfm ./fault
Client ns prterun-Ralphs-iMac-2-83504@1 rank 3: Running
Client ns prterun-Ralphs-iMac-2-83504@1 rank 2: Running
Client ns prterun-Ralphs-iMac-2-83504@1 rank 0: Running
Client ns prterun-Ralphs-iMac-2-83504@1 rank 1: Running
Client ns prterun-Ralphs-iMac-2-83504@1 rank 0: exiting with error
CLIENT prterun-Ralphs-iMac-2-83504@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-Ralphs-iMac-2-83504@1:0 EXIT STATUS 1
Client ns prterun-Ralphs-iMac-2-83504@1 rank 3: Finalizing
CLIENT prterun-Ralphs-iMac-2-83504@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-Ralphs-iMac-2-83504@1:0 EXIT STATUS 1
Client ns prterun-Ralphs-iMac-2-83504@1 rank 2: Finalizing
CLIENT prterun-Ralphs-iMac-2-83504@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-Ralphs-iMac-2-83504@1:0 EXIT STATUS 1
Client ns prterun-Ralphs-iMac-2-83504@1 rank 1: Finalizing
Client ns prterun-Ralphs-iMac-2-83504@1 rank 3:PMIx_Finalize successfully completed
Client ns prterun-Ralphs-iMac-2-83504@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-Ralphs-iMac-2-83504@1 rank 1:PMIx_Finalize successfully completed
$
I suspect this is a case of the OMPI submodule pointers being stale. Can you try it with updated submodules? I'll also try it with multiple nodes in the morning.
Well, working with HEAD of both master branches again, it works fine on multi-node scenario:
$ prterun --personality ompi --enable-recovery --with-ft ulfm --map-by ppr:2:node ./fault
Client ns prterun-rhc-node01-52907@1 rank 1: Running
Client ns prterun-rhc-node01-52907@1 rank 0: Running
Client ns prterun-rhc-node01-52907@1 rank 5: Running
Client ns prterun-rhc-node01-52907@1 rank 4: Running
Client ns prterun-rhc-node01-52907@1 rank 3: Running
Client ns prterun-rhc-node01-52907@1 rank 2: Running
Client ns prterun-rhc-node01-52907@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-52907@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-52907@1:5 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 5: Finalizing
CLIENT prterun-rhc-node01-52907@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-52907@1:4 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 4: Finalizing
Client ns prterun-rhc-node01-52907@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52907@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52907@1 rank 5:PMIx_Finalize successfully completed
CLIENT prterun-rhc-node01-52907@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 3: Finalizing
Client ns prterun-rhc-node01-52907@1 rank 4:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52907@1 rank 3:PMIx_Finalize successfully completed
$
Tried excluding the HNP from the job in case you were running with mpiexec
on a login node:
$ prterun --personality ompi --enable-recovery --with-ft ulfm --map-by ppr:2:node:nolocal ./fault
Client ns prterun-rhc-node01-52917@1 rank 1: Running
Client ns prterun-rhc-node01-52917@1 rank 3: Running
Client ns prterun-rhc-node01-52917@1 rank 0: Running
Client ns prterun-rhc-node01-52917@1 rank 2: Running
Client ns prterun-rhc-node01-52917@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-52917@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52917@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52917@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-52917@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52917@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52917@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-52917@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52917@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52917@1 rank 3: Finalizing
Client ns prterun-rhc-node01-52917@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52917@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52917@1 rank 3:PMIx_Finalize successfully completed
$
So it appears this truly is just a stale submodule pointer issue, yet again.
I am seeing the same issue.
I'm sure you are - but again, it isn't a problem in the code. It's just that your submodule pointers are stale. I just verified that all is okay using the HEAD of the PMIx and PRRTE release branches:
$ prterun --personality ompi --enable-recovery --with-ft ulfm --map-by ppr:2:node:nolocal ./fault
Client ns prterun-rhc-node01-10714@1 rank 2: Running
Client ns prterun-rhc-node01-10714@1 rank 3: Running
Client ns prterun-rhc-node01-10714@1 rank 0: Running
Client ns prterun-rhc-node01-10714@1 rank 1: Running
Client ns prterun-rhc-node01-10714@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-10714@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-10714@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-10714@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-10714@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-10714@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-10714@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-10714@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-10714@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-10714@1 rank 3: Finalizing
Client ns prterun-rhc-node01-10714@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-10714@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-10714@1 rank 3:PMIx_Finalize successfully completed
$
So please update your submodule pointers!
Closing as this appears to be a stale submodule pointer issue. Please reopen if seen after updates.
This is not a stale submodule pointer issue. The number listed in the original ticket are the submodules as pulled from ompi v5.0.x, they are 4 weeks old.
4262efb8 (HEAD) Fix oac_check_package.m4 Ralph Castain 4 weeks ago
bd7d6a15 build: fix bashisms in configure Sam James 4 weeks ago
5644a70a (tag: v3.0.0) Update release date Ralph Castain 5 weeks ago
Replication requires 6 proc on 2 nodes (minimum).
On the prted
that hosted the failed app proc, the counter num_terminated
remains at 2 (meanwhile num_local_procs
is 3, as expected).
From what I tracked, the problem appears to be that only the IOF_COMPLETE state is reached, the WAITPID_FIRED is not yet triggered
jdata->num_terminated++
when transitioning to PROC_STATE_TERMINATED the first time state_prted.c:477Looks like the root cause is that WAITPID_FIRED doesn't kick-in
Did you update the submodule pointers to HEAD of the respective master branches?? If not, then this might well be a stale pointer problem. Things do change rather rapidly at times, you know, and I am getting increasingly resistant to chasing anything from OMPI unless you update the pointers.
Here is what I get with HEAD of PMIx and PRRTE master branches:
$ prterun --map-by ppr:3:node --personality ompi --enable-recovery --with-ft ulfm ./fault
Client ns prterun-rhc-node01-62534@1 rank 1: Running
Client ns prterun-rhc-node01-62534@1 rank 0: Running
Client ns prterun-rhc-node01-62534@1 rank 2: Running
Client ns prterun-rhc-node01-62534@1 rank 4: Running
Client ns prterun-rhc-node01-62534@1 rank 3: Running
Client ns prterun-rhc-node01-62534@1 rank 5: Running
Client ns prterun-rhc-node01-62534@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-62534@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-62534@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-62534@1:5 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 5: Finalizing
CLIENT prterun-rhc-node01-62534@1:4 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 4: Finalizing
CLIENT prterun-rhc-node01-62534@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 3: Finalizing
Client ns prterun-rhc-node01-62534@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62534@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62534@1 rank 3:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62534@1 rank 5:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62534@1 rank 4:PMIx_Finalize successfully completed
$
Since I could interpret your note to mean 6 procs/node on 2 nodes, I upped the ppr to 6:
$ prterun --map-by ppr:6:node --personality ompi --enable-recovery --with-ft ulfm ./fault
Client ns prterun-rhc-node01-62545@1 rank 0: Running
Client ns prterun-rhc-node01-62545@1 rank 4: Running
Client ns prterun-rhc-node01-62545@1 rank 5: Running
Client ns prterun-rhc-node01-62545@1 rank 3: Running
Client ns prterun-rhc-node01-62545@1 rank 1: Running
Client ns prterun-rhc-node01-62545@1 rank 2: Running
Client ns prterun-rhc-node01-62545@1 rank 9: Running
Client ns prterun-rhc-node01-62545@1 rank 8: Running
Client ns prterun-rhc-node01-62545@1 rank 10: Running
Client ns prterun-rhc-node01-62545@1 rank 6: Running
Client ns prterun-rhc-node01-62545@1 rank 11: Running
Client ns prterun-rhc-node01-62545@1 rank 7: Running
Client ns prterun-rhc-node01-62545@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-62545@1:5 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 5: Finalizing
CLIENT prterun-rhc-node01-62545@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 3: Finalizing
CLIENT prterun-rhc-node01-62545@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-62545@1:4 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 4: Finalizing
CLIENT prterun-rhc-node01-62545@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-62545@1:9 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 9: Finalizing
CLIENT prterun-rhc-node01-62545@1:10 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 10: Finalizing
CLIENT prterun-rhc-node01-62545@1:11 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 11: Finalizing
CLIENT prterun-rhc-node01-62545@1:8 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 8: Finalizing
CLIENT prterun-rhc-node01-62545@1:6 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 6: Finalizing
Client ns prterun-rhc-node01-62545@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 4:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 3:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 5:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 6:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 8:PMIx_Finalize successfully completed
CLIENT prterun-rhc-node01-62545@1:7 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 7: Finalizing
Client ns prterun-rhc-node01-62545@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 10:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 9:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 11:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 7:PMIx_Finalize successfully completed
$
So once again, I am not seeing this problem on HEAD of the respective master branches. Since I just updated the release branches yesterday, and we updated the submodule pointers on both OMPI main and v5 branches, I suggest you update and recheck your use-case.
Thanks for taking a look. I can still replicate with the newest masters on both open PMIx an Prrte. I'm looking into it. Could you produce a log with state_base_verbose and debug-daemons (first run, 3/node, 2 nodes), I'd like to compare the difference between when it works and when it doesn't.
Interesting - could be a race condition that I win and your system loses. I'll generate the log this morning.
Here's the log: https://gist.github.com/rhc54/0f32f5628f35dd18d4049f838c01bcc0
Any update here?
@abouteiller Any update on this issue?
Hi everyone, I'm also seeing this on OpenMPI both main
and 5.x
branches
Here are the versions:
main d199429beb Merge pull request #11449 from jsquyres/pr/fix-romio431-mpl-configure-ac271-issues
22dabb8e471fb110ca75a5fd3d5f12d8d6a984a3 3rd-party/openpmix (v1.1.3-3791-g22dabb8e)
10496e38a0b54722723ec83923f6311ec82d692b 3rd-party/prrte (psrvr-v2.0.0rc1-4582-g10496e38a0)
237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)
v5.0.x 64715a76be Merge pull request #11440 from awlauria/v5.0.x_updated2
7f6f8db13b42916b27b690b8a3f9e2757ec1417f 3rd-party/openpmix (v4.2.3-8-g7f6f8db1)
c7b2c715f92495637c298249deb5493e86864ac8 3rd-party/prrte (v3.0.1rc1-36-gc7b2c715f9)
237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)
In my testing, it fails even with 2 processes on a different nodes (i.e. sbatch --ntasks-per-node=1 -n 2 run.sh
), but works with any number of processes on a single node (i.e. sbatch --ntasks-per-node=8 -n 8 run.sh
)
I'm sorry to stress this @rhc54, but are you sure you're testing with at least 2 separate nodes? From your output it seems you're using a single node.
I'm sorry to stress this @rhc54, but are you sure you're testing with at least 2 separate nodes? From your output it seems you're using a single node.
Little hard to miss one vs multiple nodes 😄 Doesn't appear that the maintainers of that feature are working on this issue, and it has nothing to do with me, I'm afraid.
Little hard to miss one vs multiple nodes 😄 Doesn't appear that the maintainers of that feature are working on this issue, and it has nothing to do with me, I'm afraid.
I'm sorry, the outputs confused me, thanks for answering.
I'm working on it but it has escaped me so far. I'll give it another stab.
Interim progress report:
I have found the location that causes the problem (we were not switching the proc state to WAITPID_FIRED in some cases before removing the waitpid callback).
Now that this is fixed, this causes the whole job to get cleaned up from the mpiexec process when 1 daemon reports that all its local procs are gone, so I'll fix that next.
The reason I wasn't able to reproduce is that you folks forgot to mention that mpiexec
was running on a node that was not included in the allocation (i.e., a login node). I can reproduce it if mpiexec
isn't running any application processes.
@abouteiller I know you are busy. If you can tell me what you fixed so far, I can help take it the rest of the way.
This simple fix (#1695) resolved the problem for me. I even delayed the finalizing of one of the other procs to verify that the job wasn't getting cleaned up early and all looked good.
Ok, this is very similar to what I had on my side, I was just more selective about when to set WAITPID_FIRED. I'll double check but I expect that if all processes hosted by a particular daemon terminate in error that will trigger a whole job cleanup.
I specifically tested that scenario and I'm not seeing a problem - everything just waits until the remaining procs terminate before cleaning up.
The problem listed in the original report is now fixed, lets keep the discussion rolling in #1698
Background information
When testing the functionality for applications that can react to fault events (e.g., MPI ULFM), application deadlocks in MPI Finalize and/or mpiexec doesn't return because
prted
processes remain in the main loop after all processes have finalized (clean quit, or failed).What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)
Please describe the system on which you are running
Details of the problem
I can replicate the problem outside of MPI Programs using the
fault.c
example from PRRTE. One note is that because the CLI_RECOVERABLE and CLI_NOTIFYERRORS options are not accessible from the command line, I have to usempiexec
(theschizo_ompi
does add these options), rather thanprterun
, but that should have no bearing on the experiments otherwise.At this point, all
fault
processes quit normally (they are gone, as observed withps
on the nodes), but theprted
remain active on the nodes and thusmpiexec
is deadlock.In a full-fledged MPI application, this in turn transforms in a different consequence: the application processes are deadlock waiting on the end of the
PMIx_fence
that appears at the end of MPI_Finalize.Overall that sounds like a counting problem for the number of active processes, that is, processes reported was aborted or term_wo_sync will not partake in
PMIx_Fence
and should not preventprted
from exiting when no application is active.