Closed Robyroc closed 1 year ago
@bosilca @abouteiller Can you have a look?
Appears to just be a case of some outdated readme text - things moved in PRRTE and while the behavior was maintained, the mechanisms for doing it evolved. At first glance, mpi_ft_detector
should not be deprecated and there is no errmgr_detector
any longer. I'm pretty sure (would need to check) that you also don't need to configure --with-ft
for PRRTE's sake, though it might still be required for the MPI layer (I honestly don't know).
No there is more to it than just the readme. The user is doing the right things here(1). I am observing the same problem on very simple tests. This is new since just a couple of weeks (it worked on rc5 I think). The root cause is that the event PMIX_ERR_PROC_ABORTED (and it's relatives) is not produced and doesn't trigger the MPI error handler callback. The event does get produced in the 'examples/faults.c' test from prte. Looking into it.
(1): @Robyroc in your example you have only the guarantee that some ranks will return with MPI_SUCCESS, and other will return with MPI_ERR_PROC_FAILED, you don't have uniformity property that error returns are consistent. See https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/06.err_comm_dup.c)
(1): @Robyroc in your example you have only the guarantee that some ranks will return with MPI_SUCCESS, and other will return with MPI_ERR_PROC_FAILED, you don't have uniformity property that error returns are consistent. See https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/06.err_comm_dup.c)
Yes I know, I wrote my assumption badly. I would need an MPIX_Comm_agree
to get the same result on all the nodes. Nonetheless, I could assume that all the process terminate the MPI_Comm_dup
function, either with a success or with a fault. The deadlock eventuality should not be present as far as I understood, right?
I have a fix for this, coming soon
Thank you for taking the time to submit an issue!
Background information
What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)
v5.0.0rc8
Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)
from git clone
If you are building/installing from a git clone, please copy-n-paste the output from
git submodule status
.+250004266bc046c6303c8531ababdff4e1237525 3rd-party/openpmix (v1.1.3-3661-g2500042) +ca2bf3aeab38261ae7c88cea64bc782c949bd76e 3rd-party/prrte (psrvr-v2.0.0rc1-4517-gca2bf3a)
Please describe the system on which you are running
Details of the problem
After building from the sources with
I tried to use OpenMPI with ULFM to test the deadlock removal in case of faults for communicator creation operations. In particular I use this code:
I compile the code without additional options, and run it with this command:
I'm expecting the execution to print for each node
I'm <rank>, starting
andRank <rank>: error MPI_ERR_PROC_FAILED
, but I get only the first messages, suggesting me that the execution deadlocks on the MPI_Comm_dup call. I've also tried using the experimental OpenMPI detector, using the command below:All these attempts gave me no result. I managed to make it work using the deprecated option
mpi_ft_detector true
, but I think I should be able to obtain the same result without leveraging deprecated features.Is there something that I'm missing/using wrongly?
Thank you!