openpmix / prrte

PMIx Reference RunTime Environment (PRRTE)
https://pmix.org
Other
36 stars 67 forks source link

Recoverable jobs that have recovered return JOB_ABORTED_BY_SIG #962

Closed abouteiller closed 1 year ago

abouteiller commented 3 years ago

Recoverable jobs spit-out an error message and complain about failed processes after they have been recovered.

Background information

when using PRTE with FT (e.g., with Open MPI mpiexec --with-ft ulfm), the following output is produced, and mpiexec completes with an error code:

######################################################################
### TESTING 5: ULFM error handler after failure, 45s sleep
/home/bouteill/ompi/master.debug//bin/mpiexec -np 4 --with-ft mpi ./err_handler
######################################################################

Rank 0003: committing suicide
Sleeping for 45s ... ... ...
## Timings ########### Min         ### Max         ##
Barrier (no fault)  #   3.96310e-05 #   4.76010e-05
Barrier (new fault) #   3.93043e-02 #   3.93648e-02
Barrier (old fault) #   7.59000e-06 #   9.42100e-06
        TEST PASSED
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 16362 on node saturn exited on signal 9 (Killed).
--------------------------------------------------------------------------

From the output, we can see that

  1. the job has been managed as a recoverable job (correct)
  2. the MPI layer has recovered, and proceeded to continue execution without interference from PRTE (correct)
  3. when the application finalize/completes; an error gets generated and sets the completion status of the job to an error code (not correct)

This is because at some point, we have marked the jdata->state to PRTE_PROC_ABORTED_BY_SIG.

Potential solution

One option is to not mark the job as ABORTED_BY_SOMETHING, and only mark individual procs, in the case the job is recoverable. The client (e.g., MPI) should take action to abort the job if needed, in this case, and this is what should result in setting the jdata->state ABORTED_BY_SOMETHING flag for the job.

In the error case, there may be a technical difficulty in attributing the ABORTED_BY_SOMETHING to the appropriate SOMETHING: that is, if a proc gets ABORTED_BY_SIGNAL, and the job is recoverable, and the MPI layer decides that it cannot recover, then we may mis-identify the condition as having called PMIX_Abort()

What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)

00a065fb02e9ed25190ae0a5a0c476722fc38984 (HEAD -> fix/bmg, mine/fix/bmg) 
...
9d8188f0 (origin/master, origin/HEAD, master) Merge pull request #956 from rhc54/topic/fnc                             Ralph Castain   7 days ago
What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)
1191ca59 (HEAD) Merge pull request #2189 from rhc54/topic/dedup                                                        Ralph Castain   8 days ago

Please describe the system on which you are running


Details of the problem

To replicate the issue, one can use Open MPI and ompi-tests-public

git submodule && autogen, etc. 
cd ompi-builddir
${ompi_srcdir}/configure && make install
git clone openmpi/ompi-tests-public
cd ompi-tests-public 
git submodule update --recursive 
cd ulfm-testing/api
salloc -N 2 ompi/master.debug//bin/mpiexec -N 2 -np 4 --with-ft mpi ./err_returns
### To run ALL FT tests:
#ULFM_PREFIX=${ompi_builddir} runtest.sh
abouteiller commented 1 year ago

Defect not present anymore