pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
550 stars 281 forks source link

bug/jenkins: ft/multi_isendalive fails on multiple platforms #2203

Open mpichbot opened 8 years ago

mpichbot commented 8 years ago

Originally by huiweilu on 2014-11-18 08:48:56 -0600


multi_isendalive fails on the following platforms:

mpichbot commented 8 years ago

Originally by huiweilu on 2014-11-18 08:52:42 -0600


  1. If configured with --enable-nemesis-dbg-localoddeven, it will stuck most frequently when rank 1 exit and rank 3 is still in MPI_Init, and it's always rank 3 failing. It seems there is a barrier between rank 1 and rank 3 (MPID_nem_barrier). Looks like the barrier is a shared memory barrier based on OPA atomics. My guess is that killing rank 1 also causes some memory issues in rank 3 so that rank 3 exits. But the exact reason remains unknown.

To avoid failing in init (as the standard says "Initialization does not have any new semantics related to fault tolerance"), a barrier is recommended before killing rank 1. After adding the barrier, there is still error but happens less frequently (1/100).

  1. The error is not limited to issend. If replacing issend with isend or send, the error still happen.
  2. Even remove all issend call, just do init and finalize, there will be error when finalize. The log shows it is stuck at closing VC of a finalized process.
mpichbot commented 8 years ago

Originally by huiweilu on 2014-11-18 09:06:27 -0600


With a simplified version, the error can be reproduced on MacOS with configure --enable-g=all --enable-fast=O0 --enable-strict=all --enable-nemesis-dbg-localoddeven. The chance of error is about 1/100.

#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>

int main(int argc, char **argv)
{
    int rank, size, err;
    char buf[10];
    MPI_Request request;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    fprintf( stdout, "rank %d finished MPI_Init\n", rank);

    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
    MPI_Barrier(MPI_COMM_WORLD);

    if (rank == 1) {
        exit(EXIT_FAILURE);
    }

    fprintf( stdout, "rank %d start finalize\n", rank);
    MPI_Finalize();

    return 0;
}

LOG:
(lldb) bt
  - thread #1: tid # 0x384a70, 0x00007fff87649162 libsystem_kernel.dylib`__psynch_mutexwait + 10, queue'com.apple.main-thread', stop reason = signal SIGSTOP
  - frame #0: 0x00007fff87649162 libsystem_kernel.dylib`__psynch_mutexwait + 10
    frame #1: 0x00007fff8560981e libsystem_pthread.dylib`_pthread_mutex_lock + 480
    frame #2: 0x00007fff89fafb78 libsystem_c.dylib`vfprintf_l + 28
    frame #3: 0x00007fff89fa8620 libsystem_c.dylib`fprintf + 186
    frame #4: 0x0000000108fc3389 multi_isendalive`MPIU_DBG_Outevent(file=0x00000001091204ee, line=93, class=16384, kind=2, fmat=0x0000000109123b14) + 1625 at dbg_printf.c:484
    frame #5: 0x000000010907be4e multi_isendalive`sigusr1_handler(sig=30) + 142 at ch3_progress.c:93
    frame #6: 0x00007fff90cf9f1a libsystem_platform.dylib`_sigtramp + 26
    frame #7: 0x00007fff85609e21 libsystem_pthread.dylib`__mtx_droplock + 391
    frame #8: 0x00007fff85609be2 libsystem_pthread.dylib`pthread_mutex_unlock + 63
    frame #9: 0x00007fff89fafb9d libsystem_c.dylib`vfprintf_l + 65
    frame #10: 0x00007fff89fa8620 libsystem_c.dylib`fprintf + 186
    frame #11: 0x0000000108fc30bb multi_isendalive`MPIU_DBG_Outevent(file=0x000000010910f952, line=1082, class=8388608, kind=0, fmat=0x0000000109121790) + 907 at dbg_printf.c:466
    frame #12: 0x0000000108fb5986 multi_isendalive`MPIR_Err_create_code_valist(lastcode=0, fatal=0, fcname=0x0000000109118fa3, line=1192, error_class=101, generic_msg=0x0000000109118fc4, specific_msg=0x0000000000000000, Argp=0x00007fff56d62cc0) + 2822 at errutil.c:1082
    frame #13: 0x0000000108fb4e2e multi_isendalive`MPIR_Err_create_code(lastcode=0, fatal=0, fcname=0x0000000109118fa3, line=1192, error_class=101, generic_msg=0x0000000109118fc4, specific_msg=0x0000000000000000) + 702 at errutil.c:868
    frame #14: 0x0000000108ff8526 multi_isendalive`MPIDI_CH3U_Complete_posted_with_error(vc=0x00007fa45b804a70) + 182 at ch3u_recvq.c:1192
    frame #15: 0x0000000108fdd651 multi_isendalive`MPIDI_CH3U_Handle_connection(vc=0x00007fa45b804a70, event=MPIDI_VC_EVENT_TERMINATED) + 1921 at ch3u_handle_connection.c:130
    frame #16: 0x00000001090adae2 multi_isendalive`error_closed(vc=0x00007fa45b804a70, req_errno=0) + 162 at socksm.c:1962
    frame #17: 0x00000001090b4322 multi_isendalive`MPID_nem_tcp_cleanup_on_error(vc=0x00007fa45b804a70, req_errno=0) + 146 at socksm.c:1993
    frame #18: 0x00000001090b4f59 multi_isendalive`MPID_nem_tcp_recv_handler(sc=0x00007fa45b0012d0) + 825 at socksm.c:1555
    frame #19: 0x00000001090b2d95 multi_isendalive`state_commrdy_handler(plfd=0x00007fa45ac055c0, sc=0x00007fa45b0012d0) + 309 at socksm.c:1683
    frame #20: 0x00000001090b412c multi_isendalive`MPID_nem_tcp_connpoll(in_blocking_poll=1) + 1420 at socksm.c:1845
    frame #21: 0x000000010908fe0e multi_isendalive`MPID_nem_network_poll(in_blocking_progress=1) + 30 at mpid_nem_network_poll.c:16
    frame #22: 0x0000000109078849 multi_isendalive`MPID_nem_mpich_blocking_recv(cell=0x00007fff56d63e58, in_fbox=0x00007fff56d63e54, completions=4) + 633 at mpid_nem_inline.h:906
    frame #23: 0x000000010907766d multi_isendalive`MPIDI_CH3I_Progress(progress_state=0x00007fff56d640c8, is_blocking=1) + 733 at ch3_progress.c:359
    frame #24: 0x0000000108fdf407 multi_isendalive`MPIDI_CH3U_VC_WaitForClose + 247 at ch3u_handle_connection.c:383
    frame #25: 0x0000000109045ea1 multi_isendalive`MPID_Finalize + 209 at mpid_finalize.c:106
    frame #26: 0x0000000108ec760d multi_isendalive`MPI_Finalize + 2685 at finalize.c:237
    frame #27: 0x0000000108e9c332 multi_isendalive`main(argc=1, argv=0x00007fff56d64708) + 194 at multi_isendalive.c:59
    frame #28: 0x00007fff948305c9 libdyld.dylib`start + 1
    frame #29: 0x00007fff948305c9 libdyld.dylib`start + 1

The error located in this simplified case: in rank 0, fprintf calls pthread_mutex_unlock, and triggers SIGUSR1. This signal is caught by ULFM sigusr1_handler, which calls fprintf recursively and causes error.

The cause of the failure of this simplified case is the conflict use of SIGUSR1 in ulfm and c library. Specifically, on MacOS (and possibly freebsd32 and solaris, too), the SIGUSR1 is reserved by pthread library to deal with mutex conditions.

On MacOS it only happens with 'enable-g=all'. On Linux platform I tested multi_isendalive with 'enable-g=all' 3000 times and it's OK. Also, if I remove --enable-g=all on MacOS, it runs the original multi_isendalive (with barrier, not simplified version) 1000 times without fail.