pmodels / casper

Process-based Asynchronous Progress Model for MPI Communication
https://pmodels.github.io/casper-www/
Other
9 stars 4 forks source link

Silent failure in MPI_Init #31

Closed devreal closed 3 years ago

devreal commented 5 years ago

I tried running Casper (current git) with MPICH 3.3.1, both by LD_PRELOADing and directly linking the casper library. Unfortunately, the application run immediately aborts without any noticeable hint on what went wrong:

$ CSP_NG=1 mpirun -n 2 ./a.out
Abort(-1) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 0
Abort(-1) on node 1 (rank 1 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, -1) - process 1

No thread support is requested from MPI.

Any idea on how to diagnose this issue? All I get is the following backtrace from gdb (not really helpful as this is the target of a goto, which is not obvious in the code):

(gdb) bt
#0  0x00002aaaaada3b50 in PMPI_Abort ()
   from mpich-3.3.1-intel/lib/libmpi.so.12
#1  0x00002aaaaaafe28d in MPI_Init_thread (argc=0x44000000, argv=0xffffffff, 
    required=0, provided=0x40000202) at ../src/common/init/initthread.c:505
#2  0x00000000004034ab in main ()
hzhou commented 5 years ago

Was it running with "intel MPI"? In the upstream mpich, initthread.c is located in src/mpi/init/.

devreal commented 5 years ago

The path of initthread.c is from Casper, not MPI itself. The installation is MPICH 3.3.1 compiled with the Intel compiler (hence mpich-3.3.1-intel/lib/libmpi.so.12 in my naming scheme).

jeffhammond commented 4 years ago

Any progress here? I can't tell whether there is a problem here to debug.

minsii commented 4 years ago

@devreal Sorry for getting back to you so late. Somehow I did not get notification for this issue. Three ways you can try:

  1. Setting CSP_VERBOSE environment variable and see if it reports any error/warning. E.g.,
    CSP_VERBOSE=4 mpirun -np 3 ./put
  2. Configure with CFLAGS="-g -O0 -DCSP_DEBUG", and rerun the test. The debug message might be too dense, you can forward it to me.
  3. If you configure with CFLAGS="-g -O0", you can set breakpoint and debug where it jumps to fn_fail (line 505 is failure handling in MPI_Init_thread, which means a previous init subroutine returns error)
minsii commented 3 years ago

@devreal Are you still facing this issue? Or I can close it?

devreal commented 3 years ago

At this point I don't have time to reproduce it, thanks nevertheless. Closing.