ulfm-devel / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
0 stars 0 forks source link

MPI_Barrier with MPI_THREAD_MULTIPLE causes assertion in ompi_request_wait_completion #41

Closed abouteiller closed 5 years ago

abouteiller commented 5 years ago

Original report by Nathan Weeks (Bitbucket: [Nathan Weeks](https://bitbucket.org/Nathan Weeks), ).


When ULFM2 (commit 04b0a92b540b2163b37f840bc3f35b2992567de4) is configured with --enable-debug, example 01.err_returns.c from the ulfm-testing repo (modified to initialize MPI with MPI_THREAD_MULTIPLE---see attached 01.err_returns-thread_multiple.c), deadlocks in MPI_Barrier():

# mpicc -g 01.err_returns-thread_multiple.c
# mpiexec --allow-run-as-root --oversubscribe -n 3 ./a.out
a.out: ../ompi/request/request.h:459: ompi_request_wait_completion: Assertion `((void*)1L == (req)->req_complete)' failed.
[cde2e163dc0a:00077] *** Process received signal ***
[cde2e163dc0a:00077] Signal: Aborted (6)
[cde2e163dc0a:00077] Signal code:  (-6)
a.out: ../ompi/request/request.h:459: ompi_request_wait_completion: Assertion `((void*)1L == (req)->req_complete)' failed.
[cde2e163dc0a:00075] *** Process received signal ***
[cde2e163dc0a:00077] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7f1bbf1155d0]
[cde2e163dc0a:00077] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7f1bbed6f207]
[cde2e163dc0a:00077] [ 2] /lib64/libc.so.6(abort+0x148)[0x7f1bbed708f8]
[cde2e163dc0a:00077] [ 3] /lib64/libc.so.6(+0x2f026)[0x7f1bbed68026]
[cde2e163dc0a:00077] [ 4] /lib64/libc.so.6(+0x2f0d2)[0x7f1bbed680d2]
[cde2e163dc0a:00077] [ 5] /usr/local/lib/libmpi.so.0(+0x6409a)[0x7f1bbf38609a]
[cde2e163dc0a:00077] [ 6] /usr/local/lib/libmpi.so.0(ompi_request_default_wait+0x27)[0x7f1bbf386146]
[cde2e163dc0a:00077] [ 7] /usr/local/lib/libmpi.so.0(+0x117e8e)[0x7f1bbf439e8e]
[cde2e163dc0a:00077] [ 8] /usr/local/lib/libmpi.so.0(ompi_coll_base_barrier_intra_bruck+0xb0)[0x7f1bbf43a497]
[cde2e163dc0a:00077] [ 9] /usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_dec_fixed+0x82)[0x7f1bb2f48624]
[cde2e163dc0a:00077] [10] /usr/local/lib/libmpi.so.0(MPI_Barrier+0x184)[0x7f1bbf3a9a3e]
[cde2e163dc0a:00077] [11] ./a.out[0x400ad4]
[cde2e163dc0a:00077] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7f1bbed5b3d5]
[cde2e163dc0a:00077] [13] ./a.out[0x400949]
[cde2e163dc0a:00077] *** End of error message ***
[cde2e163dc0a:00075] Signal: Aborted (6)
[cde2e163dc0a:00075] Signal code:  (-6)
[cde2e163dc0a:00075] [ 0] /lib64/libpthread.so.0(+0xf5d0)[0x7fd52b7e15d0]
[cde2e163dc0a:00075] [ 1] /lib64/libc.so.6(gsignal+0x37)[0x7fd52b43b207]
[cde2e163dc0a:00075] [ 2] /lib64/libc.so.6(abort+0x148)[0x7fd52b43c8f8]
[cde2e163dc0a:00075] [ 3] /lib64/libc.so.6(+0x2f026)[0x7fd52b434026]
[cde2e163dc0a:00075] [ 4] /lib64/libc.so.6(+0x2f0d2)[0x7fd52b4340d2]
[cde2e163dc0a:00075] [ 5] /usr/local/lib/libmpi.so.0(+0x6409a)[0x7fd52ba5209a]
[cde2e163dc0a:00075] [ 6] /usr/local/lib/libmpi.so.0(ompi_request_default_wait+0x27)[0x7fd52ba52146]
[cde2e163dc0a:00075] [ 7] /usr/local/lib/libmpi.so.0(+0x117e8e)[0x7fd52bb05e8e]
[cde2e163dc0a:00075] [ 8] /usr/local/lib/libmpi.so.0(ompi_coll_base_barrier_intra_bruck+0xb0)[0x7fd52bb06497]
[cde2e163dc0a:00075] [ 9] /usr/local/lib/openmpi/mca_coll_tuned.so(ompi_coll_tuned_barrier_intra_dec_fixed+0x82)[0x7fd51b599624]
[cde2e163dc0a:00075] [10] /usr/local/lib/libmpi.so.0(MPI_Barrier+0x184)[0x7fd52ba75a3e]
[cde2e163dc0a:00075] [11] ./a.out[0x400ad4]
[cde2e163dc0a:00075] [12] /lib64/libc.so.6(__libc_start_main+0xf5)[0x7fd52b4273d5]
[cde2e163dc0a:00075] [13] ./a.out[0x400949]
[cde2e163dc0a:00075] *** End of error message ***

This has been observed on both CentOS 7.6.1810 (above example from Docker container; I could provide the Dockerfile on request) and MacOS 10.13.6.

abouteiller commented 5 years ago

Original comment by Nathan Weeks (Bitbucket: [Nathan Weeks](https://bitbucket.org/Nathan Weeks), ).


abouteiller commented 5 years ago

Original comment by Nathan Weeks (Bitbucket: [Nathan Weeks](https://bitbucket.org/Nathan Weeks), ).


My bad; the commit SHA-1 I referenced was the most recent, but I had built with a previous version. The aforementioned ULFM2 version actually results in a hang at runtime. I've attached a Dockerfile for reproducibility:

$ docker build -t ulfm2:2.1-centos7.6 .
$ docker run -it --rm -w /mnt -v $PWD:/mnt ulfm2:2.1-centos7.6 mpicc -g 01.err_returns-thread_multiple.c
$ docker run --privileged -it --rm -w /mnt -v $PWD:/mnt ulfm2:2.1-centos7.6 mpiexec --allow-run-as-root --oversubscribe --get-stack-traces --timeout 5 -n 3 ./a.out
--------------------------------------------------------------------------
The user-provided time limit for job execution has been reached:

  Timeout: 5 seconds

The job will now be aborted.  Please check your code and/or
adjust/remove the job execution time limit (as specified by --timeout
command line option or MPIEXEC_TIMEOUT environment variable).
--------------------------------------------------------------------------
Waiting for stack traces (this may take a few moments)...
STACK TRACE FOR PROC [[8954,1],0] (fc2f05ee1e64, PID 10)
    Thread 3 (Thread 0x7f58ed2bc700 (LWP 11)):
    #0  0x00007f58ef15c20d in poll () from /lib64/libc.so.6
    #1  0x00007f58eeaa2fb6 in poll_dispatch (base=0x172f9c0, tv=0x7f58ed2bbe80) at poll.c:165
    #2  0x00007f58eea9ac70 in opal_libevent2022_event_base_loop (base=0x172f9c0, flags=1) at event.c:1630
    #3  0x00007f58eea3792b in progress_engine (obj=0x172f838) at runtime/opal_progress_threads.c:105
    #4  0x00007f58ef43ddd5 in start_thread () from /lib64/libpthread.so.0
    #5  0x00007f58ef166ead in clone () from /lib64/libc.so.6
    Thread 2 (Thread 0x7f58e657a700 (LWP 13)):
    #0  0x00007f58ef167483 in epoll_wait () from /lib64/libc.so.6
    #1  0x00007f58eea97223 in epoll_dispatch (base=0x178b630, tv=<optimized out>) at epoll.c:407
    #2  0x00007f58eea9ac70 in opal_libevent2022_event_base_loop (base=0x178b630, flags=1) at event.c:1630
    #3  0x00007f58ec60fb3f in progress_engine (obj=0x178b5b8) at runtime/pmix_progress_threads.c:109
    #4  0x00007f58ef43ddd5 in start_thread () from /lib64/libpthread.so.0
    #5  0x00007f58ef166ead in clone () from /lib64/libc.so.6
    Thread 1 (Thread 0x7f58efc4b740 (LWP 10)):
    #0  0x00007f58ef14bd47 in sched_yield () from /lib64/libc.so.6
    #1  0x00007f58eea2fafe in opal_progress () at runtime/opal_progress.c:256
    #2  0x00007f58eea38dd3 in ompi_sync_wait_mt (sync=0x7ffdfbc6ff60) at threads/wait_sync.c:117
    #3  0x00007f58ef6b5f90 in ompi_request_wait_completion (req=0x1807fc0) at ../ompi/request/request.h:445
    #4  0x00007f58ef6b60cb in ompi_request_default_wait (req_ptr=0x7ffdfbc70060, status=0x7ffdfbc70040) at request/req_wait.c:42
    #5  0x00007f58ef769f88 in ompi_coll_base_sendrecv_zero (dest=1, stag=-16, source=2, rtag=-16, comm=0x601480 <ompi_mpi_comm_world>) at base/coll_base_barrier.c:64
    #6  0x00007f58ef76a591 in ompi_coll_base_barrier_intra_bruck (comm=0x601480 <ompi_mpi_comm_world>, module=0x180c9a0) at base/coll_base_barrier.c:271
    #7  0x00007f58df161624 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x601480 <ompi_mpi_comm_world>, module=0x180c9a0) at coll_tuned_decision_fixed.c:211
    #8  0x00007f58ef6d9b41 in PMPI_Barrier (comm=0x601480 <ompi_mpi_comm_world>) at pbarrier.c:78
    #9  0x0000000000400ad4 in main (argc=1, argv=0x7ffdfbc70338) at 01.err_returns-thread_multiple.c:37

STACK TRACE FOR PROC [[8954,1],1] (fc2f05ee1e64, PID 12)
    Thread 3 (Thread 0x7f2f29cde700 (LWP 15)):
    #0  0x00007f2f2bb7e20d in poll () from /lib64/libc.so.6
    #1  0x00007f2f2b4c4fb6 in poll_dispatch (base=0x8839c0, tv=0x7f2f29cdde80) at poll.c:165
    #2  0x00007f2f2b4bcc70 in opal_libevent2022_event_base_loop (base=0x8839c0, flags=1) at event.c:1630
    #3  0x00007f2f2b45992b in progress_engine (obj=0x883838) at runtime/opal_progress_threads.c:105
    #4  0x00007f2f2be5fdd5 in start_thread () from /lib64/libpthread.so.0
    #5  0x00007f2f2bb88ead in clone () from /lib64/libc.so.6
    Thread 2 (Thread 0x7f2f270b9700 (LWP 16)):
    #0  0x00007f2f2bb89483 in epoll_wait () from /lib64/libc.so.6
    #1  0x00007f2f2b4b9223 in epoll_dispatch (base=0x8df630, tv=<optimized out>) at epoll.c:407
    #2  0x00007f2f2b4bcc70 in opal_libevent2022_event_base_loop (base=0x8df630, flags=1) at event.c:1630
    #3  0x00007f2f29031b3f in progress_engine (obj=0x8df5b8) at runtime/pmix_progress_threads.c:109
    #4  0x00007f2f2be5fdd5 in start_thread () from /lib64/libpthread.so.0
    #5  0x00007f2f2bb88ead in clone () from /lib64/libc.so.6
    Thread 1 (Thread 0x7f2f2c66d740 (LWP 12)):
    #0  0x00007f2f2bb6dd47 in sched_yield () from /lib64/libc.so.6
    #1  0x00007f2f2b451afe in opal_progress () at runtime/opal_progress.c:256
    #2  0x00007f2f2b45add3 in ompi_sync_wait_mt (sync=0x7ffd300de4c0) at threads/wait_sync.c:117
    #3  0x00007f2f2c0d7f90 in ompi_request_wait_completion (req=0x95bfc0) at ../ompi/request/request.h:445
    #4  0x00007f2f2c0d80cb in ompi_request_default_wait (req_ptr=0x7ffd300de5c0, status=0x7ffd300de5a0) at request/req_wait.c:42
    #5  0x00007f2f2c18bf88 in ompi_coll_base_sendrecv_zero (dest=0, stag=-16, source=2, rtag=-16, comm=0x601480 <ompi_mpi_comm_world>) at base/coll_base_barrier.c:64
    #6  0x00007f2f2c18c591 in ompi_coll_base_barrier_intra_bruck (comm=0x601480 <ompi_mpi_comm_world>, module=0x9609c0) at base/coll_base_barrier.c:271
    #7  0x00007f2f1fbd3624 in ompi_coll_tuned_barrier_intra_dec_fixed (comm=0x601480 <ompi_mpi_comm_world>, module=0x9609c0) at coll_tuned_decision_fixed.c:211
    #8  0x00007f2f2c0fbb41 in PMPI_Barrier (comm=0x601480 <ompi_mpi_comm_world>) at pbarrier.c:78
    #9  0x0000000000400ad4 in main (argc=1, argv=0x7ffd300de898) at 01.err_returns-thread_multiple.c:37
abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


Looking into it.

abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


can reproduce

abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


fixed later report, now dealing with earlier one

abouteiller commented 5 years ago

Original comment by Aurelien Bouteiller (Bitbucket: abouteiller, GitHub: abouteiller).


commits 67ae9392 and 804bb693