ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

deadlock in recv/allreduce if participating process fails #50

Open abouteiller opened 5 years ago

abouteiller commented 5 years ago

Original report by Kai Keller (Bitbucket: kellekai, GitHub: kellekai).


a simple case is for 2 processes. Process 1 fails and process 2 calls MPI_Recv, the result is a deadlock. The same happens for MPI_Allreduce, when some processes fail after all not failing processes have called MPI_Allreduce.

Here a simple example for the first case:

#include <mpi.h>
#include <mpi-ext.h>

int main() {

    MPI_Init(NULL, NULL);
    MPI_Comm_set_errhandler( MPI_COMM_WORLD, MPI_ERRORS_RETURN );

    int rank; MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    if( rank == 0 ) {
        *(int*)NULL = 0;
    }

    int r;
    MPI_Recv( &r, 1, MPI_INT, 0, 0, MPI_COMM_WORLD, MPI_STATUS_IGNORE );

    return 0;

}

another example for the second case:

#include <mpi.h>
#include <mpi-ext.h>
#include <unistd.h>

int main() {

    MPI_Init(NULL, NULL);
    MPI_Comm_set_errhandler( MPI_COMM_WORLD, MPI_ERRORS_RETURN );

    int rank; MPI_Comm_rank( MPI_COMM_WORLD, &rank );
    if( rank == 0 ) {
        sleep(1);
        *(int*)NULL = 0;
    }

    int r, s;
    MPI_Allreduce( &r, &s, 1, MPI_INT, MPI_SUM, MPI_COMM_WORLD );

    return 0;

}
abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Trying to replicate

abouteiller commented 5 years ago

Original comment by Nuria Losada (Bitbucket: nuriallv, GitHub: nuriallv).


Hi Kai,

Do you have the same issues when tunning the mpi_ft_detector_period and mpi_ft_detector_timeout?
(E.g. --mca mpi_ft_detector_period 1e-1 --mca mpi_ft_detector_timeout 3e-1)

abouteiller commented 5 years ago

Original comment by Kai Keller (Bitbucket: kellekai, GitHub: kellekai).


Hi Nuria,

yes I do. I execute with:

mpirun --oversubscribe --mca btl tcp,self --mca mpi_ft_detector_period 1e-1 --mca mpi_ft_detector_timeout 3e-1 -n 2

can you reproduce it?

I noticed that routine ompi_request_state_ok in file req_ft.c returns with true, maybe it shouldn’t?

abouteiller commented 4 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Kai,

Thank you for your report. Please try again with a version later than 0e249ca1ae5cb27a3f3d907173b65db188380ce5; it resolves a number of deadlocks scenarios and should address your problem.