Closed abouteiller closed 5 years ago
Original comment by Nathan Weeks (Bitbucket: [Nathan Weeks](https://bitbucket.org/Nathan Weeks), ).
Original comment by Nathan Weeks (Bitbucket: [Nathan Weeks](https://bitbucket.org/Nathan Weeks), ).
My bad; the commit SHA-1 I referenced was the most recent, but I had built with a previous version. The aforementioned ULFM2 version actually results in a hang at runtime. I've attached a Dockerfile for reproducibility:
$ docker build -t ulfm2:2.1-centos7.6 .
$ docker run -it --rm -w /mnt -v $PWD:/mnt ulfm2:2.1-centos7.6 mpicc -g 01.err_returns-thread_multiple.c
$ docker run --privileged -it --rm -w /mnt -v $PWD:/mnt ulfm2:2.1-centos7.6 mpiexec --allow-run-as-root --oversubscribe --get-stack-traces --timeout 5 -n 3 ./a.out
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:
Timeout: 5 seconds
The job will now be aborted. Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
Waiting for stack traces (this may take a few moments)...
STACK TRACE FOR PROC [[8954,1],0] (fc2f05ee1e64, PID 10)
Thread 3 (Thread 0x7f58ed2bc700 (LWP 11)):
#0 0x00007f58ef15c20d in poll () from /lib64/libc.so.6
#1 0x00007f58eeaa2fb6 in poll_dispatch (base=0x172f9c0, tv=0x7f58ed2bbe80) at poll.c:165
#2 0x00007f58eea9ac70 in opal_libevent2022_event_base_loop (base=0x172f9c0, flags=1) at event.c:1630
#3 0x00007f58eea3792b in progress_engine (obj=0x172f838) at runtime/opal_progress_threads.c:105
#4 0x00007f58ef43ddd5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f58ef166ead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f58e657a700 (LWP 13)):
#0 0x00007f58ef167483 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f58eea97223 in epoll_dispatch (base=0x178b630, tv=<optimized out>) at epoll.c:407
#2 0x00007f58eea9ac70 in opal_libevent2022_event_base_loop (base=0x178b630, flags=1) at event.c:1630
#3 0x00007f58ec60fb3f in progress_engine (obj=0x178b5b8) at runtime/pmix_progress_threads.c:109
#4 0x00007f58ef43ddd5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f58ef166ead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f58efc4b740 (LWP 10)):
#0 0x00007f58ef14bd47 in sched_yield () from /lib64/libc.so.6
#1 0x00007f58eea2fafe in opal_progress () at runtime/opal_progress.c:256
#2 0x00007f58eea38dd3 in ompi_sync_wait_mt (sync=0x7ffdfbc6ff60) at threads/wait_sync.c:117
#3 0x00007f58ef6b5f90 in ompi_request_wait_completion (req=0x1807fc0) at ../ompi/request/request.h:445
#4 0x00007f58ef6b60cb in ompi_request_default_wait (req_ptr=0x7ffdfbc70060, status=0x7ffdfbc70040) at request/req_wait.c:42
#5 0x00007f58ef769f88 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, source=2, rtag=-16, comm=0x601480 <ompi_mpi_comm_world>) at base/coll_base_barrier.c:64
#6 0x00007f58ef76a591 in ompi_coll_base_barrier_intra_bruck (comm=0x601480 <ompi_mpi_comm_world>, module=0x180c9a0) at base/coll_base_barrier.c:271
#7 0x00007f58df161624 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x601480 <ompi_mpi_comm_world>, module=0x180c9a0) at coll_tuned_decision_fixed.c:211
#8 0x00007f58ef6d9b41 in PMPI_Barrier (comm=0x601480 <ompi_mpi_comm_world>) at pbarrier.c:78
#9 0x0000000000400ad4 in main (argc=1, argv=0x7ffdfbc70338) at 01.err_returns-thread_multiple.c:37
STACK TRACE FOR PROC [[8954,1],1] (fc2f05ee1e64, PID 12)
Thread 3 (Thread 0x7f2f29cde700 (LWP 15)):
#0 0x00007f2f2bb7e20d in poll () from /lib64/libc.so.6
#1 0x00007f2f2b4c4fb6 in poll_dispatch (base=0x8839c0, tv=0x7f2f29cdde80) at poll.c:165
#2 0x00007f2f2b4bcc70 in opal_libevent2022_event_base_loop (base=0x8839c0, flags=1) at event.c:1630
#3 0x00007f2f2b45992b in progress_engine (obj=0x883838) at runtime/opal_progress_threads.c:105
#4 0x00007f2f2be5fdd5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f2f2bb88ead in clone () from /lib64/libc.so.6
Thread 2 (Thread 0x7f2f270b9700 (LWP 16)):
#0 0x00007f2f2bb89483 in epoll_wait () from /lib64/libc.so.6
#1 0x00007f2f2b4b9223 in epoll_dispatch (base=0x8df630, tv=<optimized out>) at epoll.c:407
#2 0x00007f2f2b4bcc70 in opal_libevent2022_event_base_loop (base=0x8df630, flags=1) at event.c:1630
#3 0x00007f2f29031b3f in progress_engine (obj=0x8df5b8) at runtime/pmix_progress_threads.c:109
#4 0x00007f2f2be5fdd5 in start_thread () from /lib64/libpthread.so.0
#5 0x00007f2f2bb88ead in clone () from /lib64/libc.so.6
Thread 1 (Thread 0x7f2f2c66d740 (LWP 12)):
#0 0x00007f2f2bb6dd47 in sched_yield () from /lib64/libc.so.6
#1 0x00007f2f2b451afe in opal_progress () at runtime/opal_progress.c:256
#2 0x00007f2f2b45add3 in ompi_sync_wait_mt (sync=0x7ffd300de4c0) at threads/wait_sync.c:117
#3 0x00007f2f2c0d7f90 in ompi_request_wait_completion (req=0x95bfc0) at ../ompi/request/request.h:445
#4 0x00007f2f2c0d80cb in ompi_request_default_wait (req_ptr=0x7ffd300de5c0, status=0x7ffd300de5a0) at request/req_wait.c:42
#5 0x00007f2f2c18bf88 in ompi_coll_base_sendrecv_zero (dest=0, stag=-16, source=2, rtag=-16, comm=0x601480 <ompi_mpi_comm_world>) at base/coll_base_barrier.c:64
#6 0x00007f2f2c18c591 in ompi_coll_base_barrier_intra_bruck (comm=0x601480 <ompi_mpi_comm_world>, module=0x9609c0) at base/coll_base_barrier.c:271
#7 0x00007f2f1fbd3624 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x601480 <ompi_mpi_comm_world>, module=0x9609c0) at coll_tuned_decision_fixed.c:211
#8 0x00007f2f2c0fbb41 in PMPI_Barrier (comm=0x601480 <ompi_mpi_comm_world>) at pbarrier.c:78
#9 0x0000000000400ad4 in main (argc=1, argv=0x7ffd300de898) at 01.err_returns-thread_multiple.c:37
Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
Looking into it.
Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
can reproduce
Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
fixed later report, now dealing with earlier one
Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).
commits 67ae9392 and 804bb693
Original report by Nathan Weeks (Bitbucket: [Nathan Weeks](https://bitbucket.org/Nathan Weeks), ).
When ULFM2 (commit 04b0a92b540b2163b37f840bc3f35b2992567de4) is configured with
--enable-debug
, example 01.err_returns.c from the ulfm-testing repo (modified to initialize MPI with MPI_THREAD_MULTIPLE---see attached 01.err_returns-thread_multiple.c), deadlocks in MPI_Barrier():This has been observed on both CentOS 7.6.1810 (above example from Docker container; I could provide the Dockerfile on request) and MacOS 10.13.6.