Closed Haizs closed 2 years ago
Thanks for your report,
The first error is normal; you have set 'mpiexec/prte' in 'resilience' mode, but not Open MPI library itself, henceforth all errors are fatal by default at the MPI level.
You can use mpiexec --with-ft ulfm
to enable the full set of features required to run with ULFM fault tolerance mode both at the PRTE level, and MPI level.
I will push a patch for the second error shortly. I'll update this issue then.
Thanks for the reply, tested with the modified code from your PR and it works.
But I ran into another weird result with https://github.com/ICLDisco/ulfm-testing/blob/master/api/buddycr.c
> mpicc buddycr.c -o buddycr
> mpiexec -v --with-ft ulfm -n 2 --host 10.0.7.1,10.0.7.2 ./buddycr -v
[Haizs:2078808] Spawning job
[Haizs:2078808] JOB prterun-Haizs-2078808@1 EXECUTING
Rank 0000: checkpointing to 0001 after iteration 0
Rank 0001: checkpointing to 0000 after iteration 0
Rank 0000: starting bcast 1
Rank 0001: starting bcast 1
Rank 0000: starting bcast 2
Rank 0001: starting bcast 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0001: checkpointing to 0000 after iteration 2
Rank 0000: starting bcast 3
Rank 0001: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: committing suicide at iteration 4
0000: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0000: before comm_spawn 1 process, size(comm)=2, size(scomm)=1
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
> mpiexec -v --with-ft ulfm -n 3 --host 10.0.7.1,10.0.7.2,10.0.7.3 ./buddycr -v
[Haizs:2059816] Spawning job
[Haizs:2059816] JOB prterun-Haizs-2059816@1 EXECUTING
Rank 0000: checkpointing to 0001 after iteration 0
Rank 0001: checkpointing to 0002 after iteration 0
Rank 0002: checkpointing to 0000 after iteration 0
Rank 0002: starting bcast 1
Rank 0001: starting bcast 1
Rank 0000: starting bcast 1
Rank 0000: starting bcast 2
Rank 0001: starting bcast 2
Rank 0002: starting bcast 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0001: checkpointing to 0002 after iteration 2
Rank 0002: checkpointing to 0000 after iteration 2
Rank 0000: starting bcast 3
Rank 0001: starting bcast 3
Rank 0002: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: committing suicide at iteration 4
Rank 0002: starting bcast 4
0000: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0002: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0000: before comm_spawn 1 process, size(comm)=3, size(scomm)=2
0002: before comm_spawn 1 process, size(comm)=3, size(scomm)=2
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
--------------------------------------------------------------------------
All nodes which are allocated for this job are already filled.
--------------------------------------------------------------------------
0000: after comm_spawn, flag=0
0000: comm_spawn failed with an unexpected error: local return with MPI_ERR_UNKNOWN: unknown error (14)
0002: after comm_spawn, flag=0
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
Proc: [[59744,1],0]
Errorcode: 14
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
0002: comm_spawn failed with an unexpected error: local return with MPI_ERR_UNKNOWN: unknown error (14)
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 2 in communicator MPI_COMM_WORLD
Proc: [[59744,1],2]
Errorcode: 14
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
I added two output just before and after MPI_Comm_spawn
in line 251, and it seems raise(SIGKILL)
wouldn't release the MPI processes' slots, is there any options I missed?
For more information, explicitly allowing over-subscribe may also not finish normally (either exit unexpectedly early or stuck when finalize). I also tested with the script in the 'api' folder from above link, all test success while running in only one host.
> mpiexec -v --with-ft ulfm -n 2 --host 10.0.7.1,10.0.7.2 --map-by :OVERSUBSCRIBE ./buddycr -v
[Haizs:2077262] Spawning job
[Haizs:2077262] JOB prterun-Haizs-2077262@1 EXECUTING
Rank 0000: checkpointing to 0001 after iteration 0
Rank 0001: checkpointing to 0000 after iteration 0
Rank 0001: starting bcast 1
Rank 0000: starting bcast 1
Rank 0000: starting bcast 2
Rank 0001: starting bcast 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0001: checkpointing to 0000 after iteration 2
Rank 0000: starting bcast 3
Rank 0001: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: committing suicide at iteration 4
0000: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0000: before comm_spawn 1 process, size(comm)=2, size(scomm)=1
{{exit here}}
> mpiexec -v --with-ft ulfm -n 3 --host 10.0.7.1,10.0.7.2,10.0.7.3 --map-by :OVERSUBSCRIBE ./buddycr -v
[Haizs:2061281] Spawning job
[Haizs:2061281] JOB prterun-Haizs-2061281@1 EXECUTING
Rank 0000: checkpointing to 0001 after iteration 0
Rank 0001: checkpointing to 0002 after iteration 0
Rank 0002: checkpointing to 0000 after iteration 0
Rank 0000: starting bcast 1
Rank 0001: starting bcast 1
Rank 0002: starting bcast 1
Rank 0000: starting bcast 2
Rank 0001: starting bcast 2
Rank 0002: starting bcast 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0001: checkpointing to 0002 after iteration 2
Rank 0002: checkpointing to 0000 after iteration 2
Rank 0000: starting bcast 3
Rank 0001: starting bcast 3
Rank 0002: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: committing suicide at iteration 4
Rank 0002: starting bcast 4
0000: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0002: errhandler invoked with error MPI_ERR_PROC_FAILED: Process Failure
0000: before comm_spawn 1 process, size(comm)=3, size(scomm)=2
0002: before comm_spawn 1 process, size(comm)=3, size(scomm)=2
0000: after comm_spawn, flag=1
0002: after comm_spawn, flag=1
Spawnee 0: crank=1
Rank 0001: restarting from 0002 at iteration 2
Rank 0002: sending checkpoint to 0001 at iteration 2
Rank 0000: checkpointing to 0001 after iteration 2
Rank 0002: checkpointing to 0000 after iteration 2
Rank 0001: checkpointing to 0002 after iteration 2
Rank 0000: starting bcast 3
Rank 0002: starting bcast 3
Rank 0001: starting bcast 3
Rank 0000: starting bcast 4
Rank 0001: starting bcast 4
Rank 0002: starting bcast 4
Rank 0001: checkpointing to 0002 after iteration 4
Rank 0000: checkpointing to 0001 after iteration 4
Rank 0002: checkpointing to 0000 after iteration 4
Rank 0000: starting bcast 5
Rank 0001: starting bcast 5
Rank 0002: starting bcast 5
Rank 0001: test completed!
Rank 0000: test completed!
Rank 0002: test completed!
{{stuck here}}
Thanks again, I can replicate both issues; I believe both stem from the same underlying condition. Tentative scenario is that
While I work on fixing these, you can workaround this by using the INFO parameter of MPI_COMM_SPAWN, to pass-in a specific host on which you want to launch (pass in the host=, or hostfile= options; see the manage for MPI_COMM_SPAWN).
@abouteiller @Haizs is this resolved? Can we close? Maybe we should open a new issue to track the lingering bug?
Closing as pr's merged.
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.x, also same error in the latest master(52d1096)
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from git clone, configured with
--with-ft=mpi
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.Please describe the system on which you are running
Details of the problem
Sample code from https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/sc20/01.err_returns.c