open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.17k stars 860 forks source link

MPI_Comm_dup deadlocks on OMPI v5.0.0rc8 without mpi_ft_detector #11153

Closed Robyroc closed 1 year ago

Robyroc commented 1 year ago

Thank you for taking the time to submit an issue!

Background information

What version of Open MPI are you using? (e.g., v3.0.5, v4.0.2, git branch name and hash, etc.)

v5.0.0rc8

Describe how Open MPI was installed (e.g., from a source/distribution tarball, from a git clone, from an operating system distribution package, etc.)

from git clone

If you are building/installing from a git clone, please copy-n-paste the output from git submodule status.

+250004266bc046c6303c8531ababdff4e1237525 3rd-party/openpmix (v1.1.3-3661-g2500042) +ca2bf3aeab38261ae7c88cea64bc782c949bd76e 3rd-party/prrte (psrvr-v2.0.0rc1-4517-gca2bf3a)

Please describe the system on which you are running


Details of the problem

After building from the sources with

shell$ ./configure --with-ft=ulfm --with-tm=<path_to_pbs> --prefix=<installation_prefix>

I tried to use OpenMPI with ULFM to test the deadlock removal in case of faults for communicator creation operations. In particular I use this code:

#include <stdio.h>
#include "mpi.h"
#include "mpi-ext.h"
#include <signal.h>

int main(int argc, char** argv)
{
    int rank, len;
    char errstr[MPI_MAX_ERROR_STRING];
    MPI_Comm comm;
    MPI_Init(&argc, &argv);
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    printf("I'm %d, starting\n", rank);
    if(rank == 0)
        raise(SIGINT);
    int rc = MPI_Comm_dup(MPI_COMM_WORLD, &comm);
    MPI_Error_string(rc, errstr, &len);
    printf("Rank %d: error: %s\n", rank, errstr);
    MPI_Finalize();
}

I compile the code without additional options, and run it with this command:

shell$ mpirun --with-ft ulfm --mca mpi_ft_verbose 1 ./test

I'm expecting the execution to print for each node I'm <rank>, starting and Rank <rank>: error MPI_ERR_PROC_FAILED, but I get only the first messages, suggesting me that the execution deadlocks on the MPI_Comm_dup call. I've also tried using the experimental OpenMPI detector, using the command below:

shell$ mpirun --with-ft ulfm --mca mpi_ft_verbose 1 --prtemca errmgr_detector_priority 0 ./test

All these attempts gave me no result. I managed to make it work using the deprecated option mpi_ft_detector true, but I think I should be able to obtain the same result without leveraging deprecated features.

Is there something that I'm missing/using wrongly?

Thank you!

jsquyres commented 1 year ago

@bosilca @abouteiller Can you have a look?

rhc54 commented 1 year ago

Appears to just be a case of some outdated readme text - things moved in PRRTE and while the behavior was maintained, the mechanisms for doing it evolved. At first glance, mpi_ft_detector should not be deprecated and there is no errmgr_detector any longer. I'm pretty sure (would need to check) that you also don't need to configure --with-ft for PRRTE's sake, though it might still be required for the MPI layer (I honestly don't know).

abouteiller commented 1 year ago

No there is more to it than just the readme. The user is doing the right things here(1). I am observing the same problem on very simple tests. This is new since just a couple of weeks (it worked on rc5 I think). The root cause is that the event PMIX_ERR_PROC_ABORTED (and it's relatives) is not produced and doesn't trigger the MPI error handler callback. The event does get produced in the 'examples/faults.c' test from prte. Looking into it.

(1): @Robyroc in your example you have only the guarantee that some ranks will return with MPI_SUCCESS, and other will return with MPI_ERR_PROC_FAILED, you don't have uniformity property that error returns are consistent. See https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/06.err_comm_dup.c)

Robyroc commented 1 year ago

(1): @Robyroc in your example you have only the guarantee that some ranks will return with MPI_SUCCESS, and other will return with MPI_ERR_PROC_FAILED, you don't have uniformity property that error returns are consistent. See https://github.com/ICLDisco/ulfm-testing/blob/master/tutorial/06.err_comm_dup.c)

Yes I know, I wrote my assumption badly. I would need an MPIX_Comm_agree to get the same result on all the nodes. Nonetheless, I could assume that all the process terminate the MPI_Comm_dup function, either with a success or with a fault. The deadlock eventuality should not be present as far as I understood, right?

abouteiller commented 1 year ago

I have a fix for this, coming soon