Open mpichbot opened 8 years ago
Originally by huiweilu on 2014-11-18 08:52:42 -0600
--enable-nemesis-dbg-localoddeven
, it will stuck most frequently when rank 1 exit and rank 3 is still in MPI_Init, and it's always rank 3 failing. It seems there is a barrier between rank 1 and rank 3 (MPID_nem_barrier
). Looks like the barrier is a shared memory barrier based on OPA atomics. My guess is that killing rank 1 also causes some memory issues in rank 3 so that rank 3 exits. But the exact reason remains unknown.To avoid failing in init (as the standard says "Initialization does not have any new semantics related to fault tolerance"), a barrier is recommended before killing rank 1. After adding the barrier, there is still error but happens less frequently (1/100).
Originally by huiweilu on 2014-11-18 09:06:27 -0600
With a simplified version, the error can be reproduced on MacOS with configure --enable-g=all --enable-fast=O0 --enable-strict=all --enable-nemesis-dbg-localoddeven
. The chance of error is about 1/100.
#include <mpi.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
int main(int argc, char **argv)
{
int rank, size, err;
char buf[10];
MPI_Request request;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
fprintf( stdout, "rank %d finished MPI_Init\n", rank);
MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
MPI_Barrier(MPI_COMM_WORLD);
if (rank == 1) {
exit(EXIT_FAILURE);
}
fprintf( stdout, "rank %d start finalize\n", rank);
MPI_Finalize();
return 0;
}
LOG:
(lldb) bt
- thread #1: tid # 0x384a70, 0x00007fff87649162 libsystem_kernel.dylib`__psynch_mutexwait + 10, queue'com.apple.main-thread', stop reason = signal SIGSTOP
- frame #0: 0x00007fff87649162 libsystem_kernel.dylib`__psynch_mutexwait + 10
frame #1: 0x00007fff8560981e libsystem_pthread.dylib`_pthread_mutex_lock + 480
frame #2: 0x00007fff89fafb78 libsystem_c.dylib`vfprintf_l + 28
frame #3: 0x00007fff89fa8620 libsystem_c.dylib`fprintf + 186
frame #4: 0x0000000108fc3389 multi_isendalive`MPIU_DBG_Outevent(file=0x00000001091204ee, line=93, class=16384, kind=2, fmat=0x0000000109123b14) + 1625 at dbg_printf.c:484
frame #5: 0x000000010907be4e multi_isendalive`sigusr1_handler(sig=30) + 142 at ch3_progress.c:93
frame #6: 0x00007fff90cf9f1a libsystem_platform.dylib`_sigtramp + 26
frame #7: 0x00007fff85609e21 libsystem_pthread.dylib`__mtx_droplock + 391
frame #8: 0x00007fff85609be2 libsystem_pthread.dylib`pthread_mutex_unlock + 63
frame #9: 0x00007fff89fafb9d libsystem_c.dylib`vfprintf_l + 65
frame #10: 0x00007fff89fa8620 libsystem_c.dylib`fprintf + 186
frame #11: 0x0000000108fc30bb multi_isendalive`MPIU_DBG_Outevent(file=0x000000010910f952, line=1082, class=8388608, kind=0, fmat=0x0000000109121790) + 907 at dbg_printf.c:466
frame #12: 0x0000000108fb5986 multi_isendalive`MPIR_Err_create_code_valist(lastcode=0, fatal=0, fcname=0x0000000109118fa3, line=1192, error_class=101, generic_msg=0x0000000109118fc4, specific_msg=0x0000000000000000, Argp=0x00007fff56d62cc0) + 2822 at errutil.c:1082
frame #13: 0x0000000108fb4e2e multi_isendalive`MPIR_Err_create_code(lastcode=0, fatal=0, fcname=0x0000000109118fa3, line=1192, error_class=101, generic_msg=0x0000000109118fc4, specific_msg=0x0000000000000000) + 702 at errutil.c:868
frame #14: 0x0000000108ff8526 multi_isendalive`MPIDI_CH3U_Complete_posted_with_error(vc=0x00007fa45b804a70) + 182 at ch3u_recvq.c:1192
frame #15: 0x0000000108fdd651 multi_isendalive`MPIDI_CH3U_Handle_connection(vc=0x00007fa45b804a70, event=MPIDI_VC_EVENT_TERMINATED) + 1921 at ch3u_handle_connection.c:130
frame #16: 0x00000001090adae2 multi_isendalive`error_closed(vc=0x00007fa45b804a70, req_errno=0) + 162 at socksm.c:1962
frame #17: 0x00000001090b4322 multi_isendalive`MPID_nem_tcp_cleanup_on_error(vc=0x00007fa45b804a70, req_errno=0) + 146 at socksm.c:1993
frame #18: 0x00000001090b4f59 multi_isendalive`MPID_nem_tcp_recv_handler(sc=0x00007fa45b0012d0) + 825 at socksm.c:1555
frame #19: 0x00000001090b2d95 multi_isendalive`state_commrdy_handler(plfd=0x00007fa45ac055c0, sc=0x00007fa45b0012d0) + 309 at socksm.c:1683
frame #20: 0x00000001090b412c multi_isendalive`MPID_nem_tcp_connpoll(in_blocking_poll=1) + 1420 at socksm.c:1845
frame #21: 0x000000010908fe0e multi_isendalive`MPID_nem_network_poll(in_blocking_progress=1) + 30 at mpid_nem_network_poll.c:16
frame #22: 0x0000000109078849 multi_isendalive`MPID_nem_mpich_blocking_recv(cell=0x00007fff56d63e58, in_fbox=0x00007fff56d63e54, completions=4) + 633 at mpid_nem_inline.h:906
frame #23: 0x000000010907766d multi_isendalive`MPIDI_CH3I_Progress(progress_state=0x00007fff56d640c8, is_blocking=1) + 733 at ch3_progress.c:359
frame #24: 0x0000000108fdf407 multi_isendalive`MPIDI_CH3U_VC_WaitForClose + 247 at ch3u_handle_connection.c:383
frame #25: 0x0000000109045ea1 multi_isendalive`MPID_Finalize + 209 at mpid_finalize.c:106
frame #26: 0x0000000108ec760d multi_isendalive`MPI_Finalize + 2685 at finalize.c:237
frame #27: 0x0000000108e9c332 multi_isendalive`main(argc=1, argv=0x00007fff56d64708) + 194 at multi_isendalive.c:59
frame #28: 0x00007fff948305c9 libdyld.dylib`start + 1
frame #29: 0x00007fff948305c9 libdyld.dylib`start + 1
The error located in this simplified case: in rank 0, fprintf calls pthread_mutex_unlock, and triggers SIGUSR1. This signal is caught by ULFM sigusr1_handler
, which calls fprintf recursively and causes error.
The cause of the failure of this simplified case is the conflict use of SIGUSR1
in ulfm and c library. Specifically, on MacOS (and possibly freebsd32 and solaris, too), the SIGUSR1 is reserved by pthread library to deal with mutex conditions.
On MacOS it only happens with 'enable-g=all'. On Linux platform I tested multi_isendalive
with 'enable-g=all' 3000 times and it's OK. Also, if I remove --enable-g=all
on MacOS, it runs the original multi_isendalive
(with barrier, not simplified version) 1000 times without fail.
Originally by huiweilu on 2014-11-18 08:48:56 -0600
multi_isendalive fails on the following platforms: