openpmix / prrte

PMIx Reference RunTime Environment (PRRTE)
https://pmix.org
Other
35 stars 67 forks source link

Prrte example 'fault' causes 'prted' to deadlock after finalize #1607

Closed abouteiller closed 1 year ago

abouteiller commented 1 year ago

Background information

When testing the functionality for applications that can react to fault events (e.g., MPI ULFM), application deadlocks in MPI Finalize and/or mpiexec doesn't return because prted processes remain in the main loop after all processes have finalized (clean quit, or failed).

What version of the PMIx Reference RTE (PRRTE) are you using? (e.g., v2.0, v3.0, git master @ hash, etc.)

git submodule
 e770a3362fcea61778b85b4c7cfb7044443c9490 3rd-party/openpmix (v4.2.2-9-ge770a336)
 4262efb88da9292fbf5a29665dcc98eed49c751c 3rd-party/prrte (v3.0.0-2-g4262efb8)

Please describe the system on which you are running


Details of the problem

I can replicate the problem outside of MPI Programs using the fault.c example from PRRTE. One note is that because the CLI_RECOVERABLE and CLI_NOTIFYERRORS options are not accessible from the command line, I have to use mpiexec (the schizo_ompi does add these options), rather than prterun, but that should have no bearing on the experiments otherwise.

salloc -N 2 ~/ompi/v5.0.x/build/bin/mpiexec -N 3 --with-ft ulfm fault
salloc: Granted job allocation 458763
salloc: Waiting for resource configuration
salloc: Nodes c[00-01] are ready for job
Client ns prterun-saturn-21145@1 rank 2: Running
Client ns prterun-saturn-21145@1 rank 0: Running
Client ns prterun-saturn-21145@1 rank 1: Running
Client ns prterun-saturn-21145@1 rank 5: Running
Client ns prterun-saturn-21145@1 rank 3: Running
Client ns prterun-saturn-21145@1 rank 4: Running
Client ns prterun-saturn-21145@1 rank 0: exiting with error
CLIENT prterun-saturn-21145@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-saturn-21145@1:0 EXIT STATUS 1
Client ns prterun-saturn-21145@1 rank 2: Finalizing
CLIENT prterun-saturn-21145@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-saturn-21145@1:0 EXIT STATUS 1
Client ns prterun-saturn-21145@1 rank 1: Finalizing
CLIENT prterun-saturn-21145@1:5 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-saturn-21145@1:0 EXIT STATUS 1
Client ns prterun-saturn-21145@1 rank 5: Finalizing
CLIENT prterun-saturn-21145@1:4 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-saturn-21145@1:0 EXIT STATUS 1
Client ns prterun-saturn-21145@1 rank 4: Finalizing
CLIENT prterun-saturn-21145@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-saturn-21145@1:0 EXIT STATUS 1
Client ns prterun-saturn-21145@1 rank 3: Finalizing
Client ns prterun-saturn-21145@1 rank 4:PMIx_Finalize successfully completed
Client ns prterun-saturn-21145@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-saturn-21145@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-saturn-21145@1 rank 5:PMIx_Finalize successfully completed
Client ns prterun-saturn-21145@1 rank 3:PMIx_Finalize successfully completed

At this point, all fault processes quit normally (they are gone, as observed with ps on the nodes), but the prted remain active on the nodes and thus mpiexec is deadlock.

In a full-fledged MPI application, this in turn transforms in a different consequence: the application processes are deadlock waiting on the end of the PMIx_fence that appears at the end of MPI_Finalize.

Overall that sounds like a counting problem for the number of active processes, that is, processes reported was aborted or term_wo_sync will not partake in PMIx_Fence and should not prevent prted from exiting when no application is active.

rhc54 commented 1 year ago

Hmmm...it is working fine for me on my Mac (single node) with current HEAD of both PMIx and PRRTE master branches:

$ prterun -n 4 --personality ompi --enable-recovery  --with-ft ulfm ./fault
Client ns prterun-Ralphs-iMac-2-83504@1 rank 3: Running
Client ns prterun-Ralphs-iMac-2-83504@1 rank 2: Running
Client ns prterun-Ralphs-iMac-2-83504@1 rank 0: Running
Client ns prterun-Ralphs-iMac-2-83504@1 rank 1: Running
Client ns prterun-Ralphs-iMac-2-83504@1 rank 0: exiting with error
CLIENT prterun-Ralphs-iMac-2-83504@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-Ralphs-iMac-2-83504@1:0 EXIT STATUS 1
Client ns prterun-Ralphs-iMac-2-83504@1 rank 3: Finalizing
CLIENT prterun-Ralphs-iMac-2-83504@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-Ralphs-iMac-2-83504@1:0 EXIT STATUS 1
Client ns prterun-Ralphs-iMac-2-83504@1 rank 2: Finalizing
CLIENT prterun-Ralphs-iMac-2-83504@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-Ralphs-iMac-2-83504@1:0 EXIT STATUS 1
Client ns prterun-Ralphs-iMac-2-83504@1 rank 1: Finalizing
Client ns prterun-Ralphs-iMac-2-83504@1 rank 3:PMIx_Finalize successfully completed
Client ns prterun-Ralphs-iMac-2-83504@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-Ralphs-iMac-2-83504@1 rank 1:PMIx_Finalize successfully completed
$

I suspect this is a case of the OMPI submodule pointers being stale. Can you try it with updated submodules? I'll also try it with multiple nodes in the morning.

rhc54 commented 1 year ago

Well, working with HEAD of both master branches again, it works fine on multi-node scenario:

$ prterun --personality ompi --enable-recovery --with-ft ulfm --map-by ppr:2:node ./fault
Client ns prterun-rhc-node01-52907@1 rank 1: Running
Client ns prterun-rhc-node01-52907@1 rank 0: Running
Client ns prterun-rhc-node01-52907@1 rank 5: Running
Client ns prterun-rhc-node01-52907@1 rank 4: Running
Client ns prterun-rhc-node01-52907@1 rank 3: Running
Client ns prterun-rhc-node01-52907@1 rank 2: Running
Client ns prterun-rhc-node01-52907@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-52907@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-52907@1:5 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 5: Finalizing
CLIENT prterun-rhc-node01-52907@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-52907@1:4 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 4: Finalizing
Client ns prterun-rhc-node01-52907@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52907@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52907@1 rank 5:PMIx_Finalize successfully completed
CLIENT prterun-rhc-node01-52907@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52907@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52907@1 rank 3: Finalizing
Client ns prterun-rhc-node01-52907@1 rank 4:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52907@1 rank 3:PMIx_Finalize successfully completed
$

Tried excluding the HNP from the job in case you were running with mpiexec on a login node:

$ prterun --personality ompi --enable-recovery --with-ft ulfm --map-by ppr:2:node:nolocal ./fault
Client ns prterun-rhc-node01-52917@1 rank 1: Running
Client ns prterun-rhc-node01-52917@1 rank 3: Running
Client ns prterun-rhc-node01-52917@1 rank 0: Running
Client ns prterun-rhc-node01-52917@1 rank 2: Running
Client ns prterun-rhc-node01-52917@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-52917@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52917@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52917@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-52917@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52917@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52917@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-52917@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-52917@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-52917@1 rank 3: Finalizing
Client ns prterun-rhc-node01-52917@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52917@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-52917@1 rank 3:PMIx_Finalize successfully completed
$

So it appears this truly is just a stale submodule pointer issue, yet again.

janciesko commented 1 year ago

I am seeing the same issue.

rhc54 commented 1 year ago

I'm sure you are - but again, it isn't a problem in the code. It's just that your submodule pointers are stale. I just verified that all is okay using the HEAD of the PMIx and PRRTE release branches:

$ prterun --personality ompi --enable-recovery --with-ft ulfm --map-by ppr:2:node:nolocal ./fault
Client ns prterun-rhc-node01-10714@1 rank 2: Running
Client ns prterun-rhc-node01-10714@1 rank 3: Running
Client ns prterun-rhc-node01-10714@1 rank 0: Running
Client ns prterun-rhc-node01-10714@1 rank 1: Running
Client ns prterun-rhc-node01-10714@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-10714@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-10714@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-10714@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-10714@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-10714@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-10714@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-10714@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-10714@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-10714@1 rank 3: Finalizing
Client ns prterun-rhc-node01-10714@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-10714@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-10714@1 rank 3:PMIx_Finalize successfully completed
$

So please update your submodule pointers!

rhc54 commented 1 year ago

Closing as this appears to be a stale submodule pointer issue. Please reopen if seen after updates.

abouteiller commented 1 year ago

This is not a stale submodule pointer issue. The number listed in the original ticket are the submodules as pulled from ompi v5.0.x, they are 4 weeks old.

4262efb8 (HEAD) Fix oac_check_package.m4      Ralph Castain  4 weeks ago
bd7d6a15 build: fix bashisms in configure                     Sam James  4 weeks ago
5644a70a (tag: v3.0.0) Update release date            Ralph Castain  5 weeks ago

Replication requires 6 proc on 2 nodes (minimum).

On the prted that hosted the failed app proc, the counter num_terminated remains at 2 (meanwhile num_local_procs is 3, as expected).

From what I tracked, the problem appears to be that only the IOF_COMPLETE state is reached, the WAITPID_FIRED is not yet triggered

  1. at errmgr_prted.c:528, we notify the HNP, then jump to cleanup because we found IOF_COMPLETE and NONZERO_EXIT, and !WAITPID_FIRED.
  2. WAITPID_FIRED event never kicks in, so the proc never transition to PROC_STATE_TERMINATED as it should errmgr_prted.c:641
  3. we were supposed to do the jdata->num_terminated++ when transitioning to PROC_STATE_TERMINATED the first time state_prted.c:477

Looks like the root cause is that WAITPID_FIRED doesn't kick-in

rhc54 commented 1 year ago

Did you update the submodule pointers to HEAD of the respective master branches?? If not, then this might well be a stale pointer problem. Things do change rather rapidly at times, you know, and I am getting increasingly resistant to chasing anything from OMPI unless you update the pointers.

rhc54 commented 1 year ago

Here is what I get with HEAD of PMIx and PRRTE master branches:

$ prterun --map-by ppr:3:node --personality ompi --enable-recovery  --with-ft ulfm ./fault
Client ns prterun-rhc-node01-62534@1 rank 1: Running
Client ns prterun-rhc-node01-62534@1 rank 0: Running
Client ns prterun-rhc-node01-62534@1 rank 2: Running
Client ns prterun-rhc-node01-62534@1 rank 4: Running
Client ns prterun-rhc-node01-62534@1 rank 3: Running
Client ns prterun-rhc-node01-62534@1 rank 5: Running
Client ns prterun-rhc-node01-62534@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-62534@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-62534@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-62534@1:5 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 5: Finalizing
CLIENT prterun-rhc-node01-62534@1:4 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 4: Finalizing
CLIENT prterun-rhc-node01-62534@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62534@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62534@1 rank 3: Finalizing
Client ns prterun-rhc-node01-62534@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62534@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62534@1 rank 3:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62534@1 rank 5:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62534@1 rank 4:PMIx_Finalize successfully completed
$

Since I could interpret your note to mean 6 procs/node on 2 nodes, I upped the ppr to 6:

$ prterun --map-by ppr:6:node --personality ompi --enable-recovery  --with-ft ulfm ./fault
Client ns prterun-rhc-node01-62545@1 rank 0: Running
Client ns prterun-rhc-node01-62545@1 rank 4: Running
Client ns prterun-rhc-node01-62545@1 rank 5: Running
Client ns prterun-rhc-node01-62545@1 rank 3: Running
Client ns prterun-rhc-node01-62545@1 rank 1: Running
Client ns prterun-rhc-node01-62545@1 rank 2: Running
Client ns prterun-rhc-node01-62545@1 rank 9: Running
Client ns prterun-rhc-node01-62545@1 rank 8: Running
Client ns prterun-rhc-node01-62545@1 rank 10: Running
Client ns prterun-rhc-node01-62545@1 rank 6: Running
Client ns prterun-rhc-node01-62545@1 rank 11: Running
Client ns prterun-rhc-node01-62545@1 rank 7: Running
Client ns prterun-rhc-node01-62545@1 rank 0: exiting with error
CLIENT prterun-rhc-node01-62545@1:5 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 5: Finalizing
CLIENT prterun-rhc-node01-62545@1:3 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 3: Finalizing
CLIENT prterun-rhc-node01-62545@1:1 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 1: Finalizing
CLIENT prterun-rhc-node01-62545@1:4 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 4: Finalizing
CLIENT prterun-rhc-node01-62545@1:2 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 2: Finalizing
CLIENT prterun-rhc-node01-62545@1:9 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 9: Finalizing
CLIENT prterun-rhc-node01-62545@1:10 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 10: Finalizing
CLIENT prterun-rhc-node01-62545@1:11 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 11: Finalizing
CLIENT prterun-rhc-node01-62545@1:8 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 8: Finalizing
CLIENT prterun-rhc-node01-62545@1:6 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 6: Finalizing
Client ns prterun-rhc-node01-62545@1 rank 1:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 4:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 3:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 5:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 6:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 8:PMIx_Finalize successfully completed
CLIENT prterun-rhc-node01-62545@1:7 NOTIFIED STATUS PROC-EXIT-NONZERO-TERM - AFFECTED prterun-rhc-node01-62545@1:0 EXIT STATUS 1
Client ns prterun-rhc-node01-62545@1 rank 7: Finalizing
Client ns prterun-rhc-node01-62545@1 rank 2:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 10:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 9:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 11:PMIx_Finalize successfully completed
Client ns prterun-rhc-node01-62545@1 rank 7:PMIx_Finalize successfully completed
$

So once again, I am not seeing this problem on HEAD of the respective master branches. Since I just updated the release branches yesterday, and we updated the submodule pointers on both OMPI main and v5 branches, I suggest you update and recheck your use-case.

abouteiller commented 1 year ago

Thanks for taking a look. I can still replicate with the newest masters on both open PMIx an Prrte. I'm looking into it. Could you produce a log with state_base_verbose and debug-daemons (first run, 3/node, 2 nodes), I'd like to compare the difference between when it works and when it doesn't.

rhc54 commented 1 year ago

Interesting - could be a race condition that I win and your system loses. I'll generate the log this morning.

rhc54 commented 1 year ago

Here's the log: https://gist.github.com/rhc54/0f32f5628f35dd18d4049f838c01bcc0

rhc54 commented 1 year ago

Any update here?

naughtont3 commented 1 year ago

@abouteiller Any update on this issue?

rcoacci commented 1 year ago

Hi everyone, I'm also seeing this on OpenMPI both main and 5.x branches Here are the versions:

 main      d199429beb Merge pull request #11449 from jsquyres/pr/fix-romio431-mpl-configure-ac271-issues
 22dabb8e471fb110ca75a5fd3d5f12d8d6a984a3 3rd-party/openpmix (v1.1.3-3791-g22dabb8e)
 10496e38a0b54722723ec83923f6311ec82d692b 3rd-party/prrte (psrvr-v2.0.0rc1-4582-g10496e38a0)
 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)
 v5.0.x 64715a76be Merge pull request #11440 from awlauria/v5.0.x_updated2
 7f6f8db13b42916b27b690b8a3f9e2757ec1417f 3rd-party/openpmix (v4.2.3-8-g7f6f8db1)
 c7b2c715f92495637c298249deb5493e86864ac8 3rd-party/prrte (v3.0.1rc1-36-gc7b2c715f9)
 237ceff1a8ed996d855d69f372be9aaea44919ea config/oac (237ceff)

In my testing, it fails even with 2 processes on a different nodes (i.e. sbatch --ntasks-per-node=1 -n 2 run.sh), but works with any number of processes on a single node (i.e. sbatch --ntasks-per-node=8 -n 8 run.sh)

I'm sorry to stress this @rhc54, but are you sure you're testing with at least 2 separate nodes? From your output it seems you're using a single node.

rhc54 commented 1 year ago

I'm sorry to stress this @rhc54, but are you sure you're testing with at least 2 separate nodes? From your output it seems you're using a single node.

Little hard to miss one vs multiple nodes 😄 Doesn't appear that the maintainers of that feature are working on this issue, and it has nothing to do with me, I'm afraid.

rcoacci commented 1 year ago

Little hard to miss one vs multiple nodes 😄 Doesn't appear that the maintainers of that feature are working on this issue, and it has nothing to do with me, I'm afraid.

I'm sorry, the outputs confused me, thanks for answering.

abouteiller commented 1 year ago

I'm working on it but it has escaped me so far. I'll give it another stab.

abouteiller commented 1 year ago

Interim progress report:

I have found the location that causes the problem (we were not switching the proc state to WAITPID_FIRED in some cases before removing the waitpid callback).

Now that this is fixed, this causes the whole job to get cleaned up from the mpiexec process when 1 daemon reports that all its local procs are gone, so I'll fix that next.

rhc54 commented 1 year ago

The reason I wasn't able to reproduce is that you folks forgot to mention that mpiexec was running on a node that was not included in the allocation (i.e., a login node). I can reproduce it if mpiexec isn't running any application processes.

@abouteiller I know you are busy. If you can tell me what you fixed so far, I can help take it the rest of the way.

rhc54 commented 1 year ago

This simple fix (#1695) resolved the problem for me. I even delayed the finalizing of one of the other procs to verify that the job wasn't getting cleaned up early and all looked good.

abouteiller commented 1 year ago

Ok, this is very similar to what I had on my side, I was just more selective about when to set WAITPID_FIRED. I'll double check but I expect that if all processes hosted by a particular daemon terminate in error that will trigger a whole job cleanup.

rhc54 commented 1 year ago

I specifically tested that scenario and I'm not seeing a problem - everything just waits until the remaining procs terminate before cleaning up.

abouteiller commented 1 year ago

The problem listed in the original report is now fixed, lets keep the discussion rolling in #1698