Open wirawan0 opened 3 years ago
can you run gstack
or attach GDB to the stuck process to see where it is?
Hi Jeff, sorry for the delay. Here are the stack traces:
the output of top when test_mpi_accs processes get stuck:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
45007 root 20 0 81252 36852 7116 R 82.6 0.0 4:11.94 test_mpi_accs
45009 root 20 0 81244 36612 6876 R 82.6 0.0 4:11.46 test_mpi_accs
==> attached by GDB::
(gdb) attach 45007
Attaching to process 45007
Reading symbols from /opt/armci-mpi-master/tests/mpi/test_mpi_accs...done.
Reading symbols from /opt/mpich/lib/libmpi.so.12...(no debugging symbols found)...done.
Loaded symbols for /opt/mpich/lib/libmpi.so.12
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 45022]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /opt/conda/lib/libgfortran.so.4...done.
Loaded symbols for /opt/conda/lib/libgfortran.so.4
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /opt/conda/lib/libgcc_s.so.1...done.
Loaded symbols for /opt/conda/lib/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/conda/lib/libquadmath.so.0...done.
Loaded symbols for /opt/conda/lib/libquadmath.so.0
0x00007fb9ed0c27f3 in sock_pe_progress_tx_entry () from /opt/mpich/lib/libmpi.so.12
Missing separate debuginfos, use: debuginfo-install glibc-2.17-307.el7.1.x86_64
(gdb) where
#0 0x00007fb9ed0c27f3 in sock_pe_progress_tx_entry () from /opt/mpich/lib/libmpi.so.12
#1 0x00007fb9ed0c3d7b in sock_pe_progress_tx_ctx () from /opt/mpich/lib/libmpi.so.12
#2 0x00007fb9ed0c5056 in sock_pe_progress_ep_tx () from /opt/mpich/lib/libmpi.so.12
#3 0x00007fb9ed0b28b2 in sock_cq_progress () from /opt/mpich/lib/libmpi.so.12
#4 0x00007fb9ed0b2f21 in sock_cq_sreadfrom () from /opt/mpich/lib/libmpi.so.12
#5 0x00007fb9ecba1612 in MPIDI_CH4R_mpi_win_unlock () from /opt/mpich/lib/libmpi.so.12
#6 0x00007fb9ecbe0c23 in PMPI_Win_unlock () from /opt/mpich/lib/libmpi.so.12
#7 0x0000555a27b38356 in main (argc=1, argv=0x7ffc89e639b8) at tests/mpi/test_mpi_accs.c:48
(gdb) attach 45009
Attaching to program: /opt/armci-mpi-master/tests/mpi/test_mpi_accs, process 45009
Reading symbols from /opt/mpich/lib/libmpi.so.12...(no debugging symbols found)...done.
Loaded symbols for /opt/mpich/lib/libmpi.so.12
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[New LWP 45023]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /opt/conda/lib/libgfortran.so.4...done.
Loaded symbols for /opt/conda/lib/libgfortran.so.4
Reading symbols from /lib64/libm.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libm.so.6
Reading symbols from /lib64/libdl.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libdl.so.2
Reading symbols from /lib64/librt.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/librt.so.1
Reading symbols from /opt/conda/lib/libgcc_s.so.1...done.
Loaded symbols for /opt/conda/lib/libgcc_s.so.1
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /opt/conda/lib/libquadmath.so.0...done.
Loaded symbols for /opt/conda/lib/libquadmath.so.0
0x00007f3404d7baeb in recv () from /lib64/libpthread.so.0
(gdb) where
#0 0x00007f3404d7baeb in recv () from /lib64/libpthread.so.0
#1 0x00007f3406008f41 in sock_comm_peek () from /opt/mpich/lib/libmpi.so.12
#2 0x00007f3406003e42 in sock_pe_progress_rx_pe_entry () from /opt/mpich/lib/libmpi.so.12
#3 0x00007f3406006b4e in sock_pe_progress_rx_ctx () from /opt/mpich/lib/libmpi.so.12
#4 0x00007f3406006c36 in sock_pe_progress_ep_rx () from /opt/mpich/lib/libmpi.so.12
#5 0x00007f3405ff58f3 in sock_cq_progress () from /opt/mpich/lib/libmpi.so.12
#6 0x00007f3405ff5f21 in sock_cq_sreadfrom () from /opt/mpich/lib/libmpi.so.12
#7 0x00007f3405ae4612 in MPIDI_CH4R_mpi_win_unlock () from /opt/mpich/lib/libmpi.so.12
#8 0x00007f3405b23c23 in PMPI_Win_unlock () from /opt/mpich/lib/libmpi.so.12
#9 0x000055950d909356 in main (argc=1, argv=0x7ffc580e9478) at tests/mpi/test_mpi_accs.c:48
This "stuck" is a random occasion; it does not always happen.
The MPICH version:
root@wahab-01:/opt/mpich/bin# ./mpichversion
MPICH Version: 3.3.2
MPICH Release date: Tue Nov 12 21:23:16 CST 2019
MPICH Device: ch4:ofi
MPICH configure: --prefix=/opt/mpich --with-device=ch4:ofi
MPICH CC: x86_64-conda_cos6-linux-gnu-gcc -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -O2
MPICH CXX: x86_64-conda_cos6-linux-gnu-g++ -fvisibility-inlines-hidden -std=c++17 -fmessage-length=0 -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -O2
MPICH F77: /opt/conda/bin/x86_64-conda_cos6-linux-gnu-gfortran -fopenmp -march=nocona -mtune=haswell -ftree-vectorize -fPIC -fstack-protector-strong -fno-plt -O2 -ffunction-sections -pipe -isystem /opt/conda/include -O2
MPICH FC: x86_64-conda_cos6-linux-gnu-gfortran -O2
MPICH Custom Information:
The line 48 above is the very first time MPI_Win_unlock
was called in the code (first loop).
It seems that it is stuck in the libfabric sockets conduit. I have never seen this before.
Try running with MPICH_ASYNC_PROGRESS=1
(no need to recompile anything). If that works, we have a clue.
If that works, can you then rebuild ARMCI-MPI with configure --with-progress ..
and run with ARMCI_PROGRESS_THREAD=1
and ARMCI_PROGRESS_USLEEP=100
? That is a second clue.
If neither works, I have to assume the container situation has broken something. I do not use containers so I don't know how to debug this.
With MPICH_ASYNC_PROGRESS=1
the result is rather erratic (sometimes it works but about 25% it would be stuck). Here is a test script to launch the same program again and again and test it:
#!/bin/bash
# second test: with MPICH_ASYNC_PROGRESS=1
source /etc/container_runtime/runtime
load_mpich
: ${COUNT:=100}
export MPICH_ASYNC_PROGRESS=1
for c in $(seq 1 $COUNT); do
echo "Iteration $c - $(date +%Y-%m-%dT%H:%M:%S)"
tm_start=$(date +%s.%N)
mpiexec -n 2 ./test_mpi_accs
tm_end=$(date +%s.%N)
awk -v tm_start="$tm_start" -v tm_end="$tm_end" -v c="$c" \
'BEGIN { printf("- iteration %3d time = %.3f secs\n", c, tm_end - tm_start) }'
sleep 5s
done
Here is the example output (filtered to only a few relevant texts)--you see 3x stuck here:
- iteration 1 time = 1.089 secs
- iteration 2 time = 2.035 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration 3 time = 15.499 secs
- iteration 4 time = 2.090 secs
- iteration 5 time = 1.494 secs
- iteration 6 time = 2.926 secs
- iteration 7 time = 3.569 secs
- iteration 8 time = 1.292 secs
- iteration 9 time = 1.506 secs
- iteration 10 time = 1.785 secs
- iteration 11 time = 2.163 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration 12 time = 215.610 secs
- iteration 13 time = 2.391 secs
^C[mpiexec@wahab-01] Sending Ctrl-C to processes as requested
[mpiexec@wahab-01] Press Ctrl-C again to force abort
- iteration 14 time = 24.370 secs
On a related note, I tried to build armci-mpi outside container, I ran into a different issue, which I post on a separate issue - #32 .
While building armci-mpi library to use on our cluster, I found that the test_mpi_accs program could not progress. I don't have a good information about where the program are stuck yet. The underlying MPI library is MPICH 3.1. This build was taking place in a Singularity container. The compiler is GCC version 7.3.0 (crosstool-NG 1.23.0.449-a04d0) provided by conda, and the MPICH library was built with that same GCC toolchain.