Recoverable jobs spit-out an error message and complain about failed processes after they have been recovered.
Background information
when using PRTE with FT (e.g., with Open MPI mpiexec --with-ft ulfm), the following output is produced, and mpiexec completes with an error code:
######################################################################
### TESTING 5: ULFM error handler after failure, 45s sleep
/home/bouteill/ompi/master.debug//bin/mpiexec -np 4 --with-ft mpi ./err_handler
######################################################################
Rank 0003: committing suicide
Sleeping for 45s ... ... ...
## Timings ########### Min ### Max ##
Barrier (no fault) # 3.96310e-05 # 4.76010e-05
Barrier (new fault) # 3.93043e-02 # 3.93648e-02
Barrier (old fault) # 7.59000e-06 # 9.42100e-06
TEST PASSED
--------------------------------------------------------------------------
mpiexec noticed that process rank 3 with PID 16362 on node saturn exited on signal 9 (Killed).
--------------------------------------------------------------------------
From the output, we can see that
the job has been managed as a recoverable job (correct)
the MPI layer has recovered, and proceeded to continue execution without interference from PRTE (correct)
when the application finalize/completes; an error gets generated and sets the completion status of the job to an error code (not correct)
This is because at some point, we have marked the jdata->state to PRTE_PROC_ABORTED_BY_SIG.
Potential solution
One option is to not mark the job as ABORTED_BY_SOMETHING, and only mark individual procs, in the case the job is recoverable. The client (e.g., MPI) should take action to abort the job if needed, in this case, and this is what should result in setting the jdata->state ABORTED_BY_SOMETHING flag for the job.
In the error case, there may be a technical difficulty in attributing the ABORTED_BY_SOMETHING to the appropriate SOMETHING: that is, if a proc gets ABORTED_BY_SIGNAL, and the job is recoverable, and the MPI layer decides that it cannot recover, then we may mis-identify the condition as having called PMIX_Abort()
What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)
00a065fb02e9ed25190ae0a5a0c476722fc38984 (HEAD -> fix/bmg, mine/fix/bmg)
...
9d8188f0 (origin/master, origin/HEAD, master) Merge pull request #956 from rhc54/topic/fnc Ralph Castain 7 days ago
What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)
1191ca59 (HEAD) Merge pull request #2189 from rhc54/topic/dedup Ralph Castain 8 days ago
Please describe the system on which you are running
Operating system/version: CentOS/7
Computer hardware: x86_64
Network type: TCP
Details of the problem
To replicate the issue, one can use Open MPI and ompi-tests-public
git submodule && autogen, etc.
cd ompi-builddir
${ompi_srcdir}/configure && make install
git clone openmpi/ompi-tests-public
cd ompi-tests-public
git submodule update --recursive
cd ulfm-testing/api
salloc -N 2 ompi/master.debug//bin/mpiexec -N 2 -np 4 --with-ft mpi ./err_returns
### To run ALL FT tests:
#ULFM_PREFIX=${ompi_builddir} runtest.sh
Recoverable jobs spit-out an error message and complain about failed processes after they have been recovered.
Background information
when using PRTE with FT (e.g., with Open MPI
mpiexec --with-ft ulfm
), the following output is produced, andmpiexec
completes with an error code:From the output, we can see that
This is because at some point, we have marked the jdata->state to PRTE_PROC_ABORTED_BY_SIG.
Potential solution
One option is to not mark the job as ABORTED_BY_SOMETHING, and only mark individual procs, in the case the job is recoverable. The client (e.g., MPI) should take action to abort the job if needed, in this case, and this is what should result in setting the jdata->state ABORTED_BY_SOMETHING flag for the job.
In the error case, there may be a technical difficulty in attributing the ABORTED_BY_SOMETHING to the appropriate SOMETHING: that is, if a proc gets ABORTED_BY_SIGNAL, and the job is recoverable, and the MPI layer decides that it cannot recover, then we may mis-identify the condition as having called PMIX_Abort()
What version of the PMIx Reference Server are you using? (e.g., v1.0, v2.1, git master @ hash, etc.)
What version of PMIx are you using? (e.g., v1.2.5, v2.0.3, v2.1.0, git branch name and hash, etc.)
Please describe the system on which you are running
Details of the problem
To replicate the issue, one can use Open MPI and ompi-tests-public