Errors occurred while testing ULFM

Haizs commented 3 years ago

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.x, also same error in the latest master(52d1096)

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from git clone, configured with --with-ft=mpi

If you are building/installing from a git clone, please copy-n-paste the output from `git submodule status`.

409d89fd2c6ad47e6cab4d4f18e72dd3e0af2e70 3rd-party/openpmix (v1.1.3-3138-g409d89fd)
8622c287399f3a22102b998aee098d97a621bdbe 3rd-party/prrte (psrvr-v2.0.0rc1-4018-g8622c28739)

Please describe the system on which you are running

Operating system/version: Ubuntu 20.04.3
Computer hardware:
Network type:

Details of the problem

Sample code from https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/sc20/01.err_returns.c

> mpicc 01.err_returns.c -o err_returns

> mpiexec -v --enable-recovery -n 2 --host 10.0.7.1,10.0.7.2 ./err_returns
[Haizs:666849] Spawning job
[Haizs:666849] JOB prterun-Haizs-666849@1 EXECUTING
[Haizs:00000] *** An error occurred in Socket closed
[Haizs:00000] *** reported by process [989659137,0]
[Haizs:00000] *** on a NULL communicator
[Haizs:00000] *** Unknown error
[Haizs:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Haizs:00000] ***    and MPI will try to terminate your MPI job as well)

> mpiexec -v --enable-recovery -n 2 --host 10.0.7.1,10.0.7.2 --omca mpi_ft_enable true --omca mpi_ft_verbose 1 ./err_returns
[Haizs:667029] Spawning job
[Haizs:667029] JOB prterun-Haizs-667029@1 EXECUTING
[Haizs:667034] [[62192,1],0] ompi: Process [[62192,1],1] failed (state = -57).
[Haizs:667034] [[62192,1],0] ompi: Error event reported through PMIx from [[62192,1],0] (state = -57). This error type is not handled by the fault tolerant layer and the application will now presumably abort.
[Haizs:00000] *** An error occurred in PMIx Event Notification
[Haizs:00000] *** reported by process [4075814913,0]
[Haizs:00000] *** on a NULL communicator
[Haizs:00000] *** Unknown error (this should not happen!)
[Haizs:00000] *** MPI_ERRORS_ARE_FATAL (processes in this communicator will now abort,
[Haizs:00000] ***    and MPI will try to terminate your MPI job as well)

abouteiller commented 3 years ago

Thanks for your report,

The first error is normal; you have set 'mpiexec/prte' in 'resilience' mode, but not Open MPI library itself, henceforth all errors are fatal by default at the MPI level.

You can use mpiexec --with-ft ulfm to enable the full set of features required to run with ULFM fault tolerance mode both at the PRTE level, and MPI level.

I will push a patch for the second error shortly. I'll update this issue then.

Haizs commented 3 years ago

Thanks for the reply, tested with the modified code from your PR and it works.

But I ran into another weird result with https://github.com/ICLDisco/ulfm-testing/blob/master/api/buddycr.c

> mpicc buddycr.c -o buddycr

> mpiexec -v --with-ft ulfm -n 2 --host 10.0.7.1,10.0.7.2 ./buddycr -v
[Haizs:2078808] Spawning job
[Haizs:2078808] JOB prterun-Haizs-2078808@1 EXECUTING
Rank 0000: checkpointing to 0001 after iteration 0
Rank 0001: checkpointing to 0000 after iteration 0
Rank 0000: starting bcast 1
Rank 0001: starting bcast 1
Rank 0000: starting bcast 2
Rank 0001: starting bcast 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0001: checkpointing to 0000 after iteration 2
Rank 0000: starting bcast 3
Rank 0001: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: committing suicide at iteration 4
0000: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0000: before comm_spawn 1 process, size(comm)=2, size(scomm)=1
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------

> mpiexec -v --with-ft ulfm -n 3 --host 10.0.7.1,10.0.7.2,10.0.7.3 ./buddycr -v
[Haizs:2059816] Spawning job
[Haizs:2059816] JOB prterun-Haizs-2059816@1 EXECUTING
Rank 0000: checkpointing to 0001 after iteration 0
Rank 0001: checkpointing to 0002 after iteration 0
Rank 0002: checkpointing to 0000 after iteration 0
Rank 0002: starting bcast 1
Rank 0001: starting bcast 1
Rank 0000: starting bcast 1
Rank 0000: starting bcast 2
Rank 0001: starting bcast 2
Rank 0002: starting bcast 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0001: checkpointing to 0002 after iteration 2
Rank 0002: checkpointing to 0000 after iteration 2
Rank 0000: starting bcast 3
Rank 0001: starting bcast 3
Rank 0002: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: committing suicide at iteration 4
Rank 0002: starting bcast 4
0000: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0002: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0000: before comm_spawn 1 process, size(comm)=3, size(scomm)=2
0002: before comm_spawn 1 process, size(comm)=3, size(scomm)=2
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
0000: after comm_spawn, flag=0
0000: comm_spawn failed with an unexpected error: local return with MPI_ERR_UNKNOWN: unknown error (14)
0002: after comm_spawn, flag=0
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
  Proc: [[59744,1],0]
  Errorcode: 14

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
0002: comm_spawn failed with an unexpected error: local return with MPI_ERR_UNKNOWN: unknown error (14)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
  Proc: [[59744,1],2]
  Errorcode: 14

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------

I added two output just before and after MPI_Comm_spawn in line 251, and it seems raise(SIGKILL) wouldn't release the MPI processes' slots, is there any options I missed?

For more information, explicitly allowing over-subscribe may also not finish normally (either exit unexpectedly early or stuck when finalize). I also tested with the script in the 'api' folder from above link, all test success while running in only one host.

> mpiexec -v --with-ft ulfm -n 2 --host 10.0.7.1,10.0.7.2 --map-by :OVERSUBSCRIBE ./buddycr -v
[Haizs:2077262] Spawning job
[Haizs:2077262] JOB prterun-Haizs-2077262@1 EXECUTING
Rank 0000: checkpointing to 0001 after iteration 0
Rank 0001: checkpointing to 0000 after iteration 0
Rank 0001: starting bcast 1
Rank 0000: starting bcast 1
Rank 0000: starting bcast 2
Rank 0001: starting bcast 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0001: checkpointing to 0000 after iteration 2
Rank 0000: starting bcast 3
Rank 0001: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: committing suicide at iteration 4
0000: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0000: before comm_spawn 1 process, size(comm)=2, size(scomm)=1
{{exit here}}

> mpiexec -v --with-ft ulfm -n 3 --host 10.0.7.1,10.0.7.2,10.0.7.3 --map-by :OVERSUBSCRIBE ./buddycr -v
[Haizs:2061281] Spawning job
[Haizs:2061281] JOB prterun-Haizs-2061281@1 EXECUTING
Rank 0000: checkpointing to 0001 after iteration 0
Rank 0001: checkpointing to 0002 after iteration 0
Rank 0002: checkpointing to 0000 after iteration 0
Rank 0000: starting bcast 1
Rank 0001: starting bcast 1
Rank 0002: starting bcast 1
Rank 0000: starting bcast 2
Rank 0001: starting bcast 2
Rank 0002: starting bcast 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0001: checkpointing to 0002 after iteration 2
Rank 0002: checkpointing to 0000 after iteration 2
Rank 0000: starting bcast 3
Rank 0001: starting bcast 3
Rank 0002: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: committing suicide at iteration 4
Rank 0002: starting bcast 4
0000: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0002: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0000: before comm_spawn 1 process, size(comm)=3, size(scomm)=2
0002: before comm_spawn 1 process, size(comm)=3, size(scomm)=2
0000: after comm_spawn, flag=1
0002: after comm_spawn, flag=1
Spawnee 0: crank=1
Rank 0001: restarting from 0002 at iteration 2
Rank 0002: sending checkpoint to 0001 at iteration 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0002: checkpointing to 0000 after iteration 2
Rank 0001: checkpointing to 0002 after iteration 2
Rank 0000: starting bcast 3
Rank 0002: starting bcast 3
Rank 0001: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: starting bcast 4
Rank 0002: starting bcast 4
Rank 0001: checkpointing to 0002 after iteration 4
Rank 0000: checkpointing to 0001 after iteration 4
Rank 0002: checkpointing to 0000 after iteration 4
Rank 0000: starting bcast 5
Rank 0001: starting bcast 5
Rank 0002: starting bcast 5
Rank 0001: test completed!
Rank 0000: test completed!
Rank 0002: test completed!
{{stuck here}}

abouteiller commented 3 years ago

Thanks again, I can replicate both issues; I believe both stem from the same underlying condition. Tentative scenario is that

PRTE doesn't mark the failed process as 'terminated' in its own tracking anymore
thus the fail process still occupies the slot (first bug)
during MPI_Finalize, PMIX_Fence will wait on the failed process, indefinitely (second bug)

While I work on fixing these, you can workaround this by using the INFO parameter of MPI_COMM_SPAWN, to pass-in a specific host on which you want to launch (pass in the host=, or hostfile= options; see the manage for MPI_COMM_SPAWN).

awlauria commented 2 years ago

@abouteiller @Haizs is this resolved? Can we close? Maybe we should open a new issue to track the lingering bug?

awlauria commented 2 years ago

Closing as pr's merged.

open-mpi / ompi