open-mpi / ompi

Open MPI main development repository
https://www.open-mpi.org
Other
2.16k stars 858 forks source link

Hangs in mca_btl_vader_component_progress on multiple archs #5638

Closed amckinstry closed 6 years ago

amckinstry commented 6 years ago

Background information

OpenMPI 3.1.2 PMIX 3.0.1

Installed on Debian /sid.

Testing across our suite of MPI programs, we're seeing hangs on some apps, it looks offhand like the common factor is 32-bit systems: i386, mipsel.

Debian bugs: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905418 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907267 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907407

This is on simple 2-core systems for the most part. I've a straightforward reproducible case in a VM here on i386 with ARPACK. The backtraces look like

#0  0xb7f43d09 in __kernel_vsyscall ()
#1  0xb7f43986 in __vdso_clock_gettime ()
#2  0xb7111af1 in __GI___clock_gettime (clock_id=1, tp=0xbffc7f54) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6e9edd7 in ?? () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#4  0xb6e4d3f8 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#5  0xb76474a5 in ompi_request_default_wait () from /usr/lib/i386-linux-gnu/libmpi.so.40
#6  0xb769d0a5 in ompi_coll_base_sendrecv_actual () from /usr/lib/i386-linux-gnu/libmpi.so.40
#7  0xb769d4a9 in ompi_coll_base_allreduce_intra_recursivedoubling () from /usr/lib/i386-linux-gnu/libmpi.so.40
#8  0xb4a39860 in ompi_coll_tuned_allreduce_intra_dec_fixed () from /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_coll_tuned.so
#9  0xb765cf9b in PMPI_Allreduce () from /usr/lib/i386-linux-gnu/libmpi.so.40
#10 0xb774de78 in pmpi_allreduce__ () from /usr/lib/i386-linux-gnu/libmpi_mpifh.so.40
#11 0xb7f0acc3 in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:79
#12 0xb7f0614d in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
    workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:776
#13 0xb7f0749d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
    mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
    at pdsaup2.f:391
#14 0xb7f0813e in pdsaupd (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, tol=1.1102230246251565e-16, resid=..., ncv=20, v=..., ldv=256, iparam=...,
    ipntr=..., workd=..., workl=..., lworkl=560, info=0, _bmat=1, _which=2) at pdsaupd.f:630
#15 0x004f6977 in parnoldi (comm=0) at issue46.f:200
#16 0x004f758f in issue46 () at issue46.f:21
#17 0x004f61ec in main (argc=1, argv=0xbffce86f) at issue46.f:23
#18 0xb70259a1 in __libc_start_main (main=0x4f61b0 <main>, argc=1, argv=0xbffcd7d4, init=0x4f75d0 <__libc_csu_init>, fini=0x4f7630 <__libc_csu_fini>,
    rtld_fini=0xb7f54f60 <_dl_fini>, stack_end=0xbffcd7cc) at ../csu/libc-start.c:310
#19 0x004f622c in _start ()

and


#0  0xb7fced09 in __kernel_vsyscall ()
#1  0xb7fce986 in __vdso_clock_gettime ()
#2  0xb719caf1 in __GI___clock_gettime (clock_id=1, tp=0xbf82ef84) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f29dd7 in ?? () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#4  0xb6ed83f8 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#5  0xb4ae5b1d in mca_pml_ob1_recv () from /usr/lib/i386-linux-gnu/openmpi/lib/openmpi3/mca_pml_ob1.so
#6  0xb77095a8 in PMPI_Recv () from /usr/lib/i386-linux-gnu/libmpi.so.40
#7  0xb77e20a5 in pmpi_recv__ () from /usr/lib/i386-linux-gnu/libmpi_mpifh.so.40
#8  0x0048e6a1 in av (comm=0, nloc=50, nx=10, mv_buf=..., v=..., w=...) at issue46.f:436
#9  0x0048e92a in parnoldi (comm=0) at issue46.f:215
#10 0x0048f58f in issue46 () at issue46.f:21
#11 0x0048e1ec in main (argc=1, argv=0xbf83586f) at issue46.f:23
#12 0xb70b09a1 in __libc_start_main (main=0x48e1b0 <main>, argc=1, argv=0xbf834264, init=0x48f5d0 <__libc_csu_init>, fini=0x48f630 <__libc_csu_fini>,
    rtld_fini=0xb7fdff60 <_dl_fini>, stack_end=0xbf83425c) at ../csu/libc-start.c:310
#13 0x0048e22c in _start ()
amckinstry commented 6 years ago

With a debug build:

#0  mca_btl_vader_check_fboxes () at btl_vader_fbox.h:170
#1  mca_btl_vader_component_progress () at btl_vader_component.c:702
#2  0xb6df9380 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#3  0xb75f34a5 in ompi_request_wait_completion (req=0x13f4d00) at ../ompi/request/request.h:413
#4  ompi_request_default_wait (req_ptr=0xbf7fdda4, status=0xbf7fdda8) at request/req_wait.c:42
#5  0xb76490a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x13da6c0, scount=1, sdatatype=0xb76abf00 <ompi_mpi_dblprec>, dest=0, stag=-12, recvbuf=0xbf7fdfe0,
    rcount=1, rdatatype=0xb76abf00 <ompi_mpi_dblprec>, source=0, rtag=-12, comm=0xb76b3fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#6  0xb76494a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf7fdfd0, rbuf=0xbf7fdfe0, count=1, dtype=0xb76abf00 <ompi_mpi_dblprec>,
    op=0xb76c1a20 <ompi_mpi_op_sum>, comm=0xb76b3fe0 <ompi_mpi_comm_world>, module=0x13f62f0) at base/coll_base_allreduce.c:217
#7  0xb49e2860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf7fdfd0, rbuf=0xbf7fdfe0, count=1, dtype=0xb76abf00 <ompi_mpi_dblprec>,
    op=0xb76c1a20 <ompi_mpi_op_sum>, comm=0xb76b3fe0 <ompi_mpi_comm_world>, module=0x13f62f0) at coll_tuned_decision_fixed.c:66
#8  0xb7608f9b in PMPI_Allreduce (sendbuf=0xbf7fdfd0, recvbuf=0xbf7fdfe0, count=1, datatype=0xb76abf00 <ompi_mpi_dblprec>, op=0xb76c1a20 <ompi_mpi_op_sum>,
    comm=0xb76b3fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#9  0xb76f9e78 in ompi_allreduce_f (sendbuf=0xbf7fdfd0 "\030!甯A\353?\001먠\335\036\364?FLl\366̑\362?\220\347", <incomplete sequence \355\267>,
    recvbuf=0xbf7fdfe0 "FLl\366̑\362?\220\347", <incomplete sequence \355\267>, count=0xb7eca1c0, datatype=0xb7eca1c4, op=0xb7eca1c8, comm=0x43d2b4,
    ierr=0xbf7fdfcc) at pallreduce_f.c:87
#10 0xb7eabcc3 in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:79
#11 0xb7ea714d in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
    workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:776
#12 0xb7ea849d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
    mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
    at pdsaup2.f:391
#13 0xb7ea913e in pdsaupd (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, tol=1.1102230246251565e-16, resid=..., ncv=20, v=..., ldv=256, iparam=...,
    ipntr=..., workd=..., workl=..., lworkl=560, info=0, _bmat=1, _which=2) at pdsaupd.f:630
#14 0x0043b977 in parnoldi (comm=0) at issue46.f:200
#15 0x0043c58f in issue46 () at issue46.f:21
#16 0x0043b1ec in main (argc=1, argv=0xbf80401e) at issue46.f:23
#17 0xb6fd19a1 in __libc_start_main (main=0x43b1b0 <main>, argc=1, argv=0xbf803524, init=0x43c5d0 <__libc_csu_init>, fini=0x43c630 <__libc_csu_fini>,
    rtld_fini=0xb7ef5f60 <_dl_fini>, stack_end=0xbf80351c) at ../csu/libc-start.c:310
#18 0x0043b22c in _start ()
(gdb) quit
A debugging session is active.

and

#0  0xb5a11b46 in mca_btl_vader_component_progress () at btl_vader_component.c:701
#1  0xb6ee2380 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#2  0xb4ae5b1d in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:413
#3  mca_pml_ob1_recv (addr=<optimized out>, count=<optimized out>, datatype=<optimized out>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>,
    status=<optimized out>) at pml_ob1_irecv.c:135
#4  0xb77135a8 in PMPI_Recv (buf=0xbff0a9c0, count=10, type=0xb7794f00 <ompi_mpi_dblprec>, source=1, tag=0, comm=0xb779cfe0 <ompi_mpi_comm_world>,
    status=0xbff09d28) at precv.c:79
#5  0xb77ec0a5 in ompi_recv_f (
    buf=0xbff0a9c0 "\254g\027\033-ư?\362L,Lq\305\"?[\004\070\347\060\216ÿK\205tH\316lÿ'\265\276\231\037\360\223?\345\210\033]sf\251\277\240XW-\257'\303?-qz\263A;\245\277x\240|\022d*\254?\263\021\202\006\274\232\250\277", count=0xbff09e00, datatype=0x459018, source=0xbff09d18, tag=0xbff09d14, comm=0x4592b4,
    status=0xbff09d28, ierr=0xbff09d10) at precv_f.c:85
#6  0x004576a1 in av (comm=0, nloc=50, nx=10, mv_buf=..., v=..., w=...) at issue46.f:436
#7  0x0045792a in parnoldi (comm=0) at issue46.f:215
#8  0x0045858f in issue46 () at issue46.f:21
#9  0x004571ec in main (argc=1, argv=0xbff1001e) at issue46.f:23
#10 0xb70ba9a1 in __libc_start_main (main=0x4571b0 <main>, argc=1, argv=0xbff0ee24, init=0x4585d0 <__libc_csu_init>, fini=0x458630 <__libc_csu_fini>,
    rtld_fini=0xb7fdef60 <_dl_fini>, stack_end=0xbff0ee1c) at ../csu/libc-start.c:310
amckinstry commented 6 years ago

https://buildd.debian.org/status/package.php?p=elpa

Fails/hangs on mips64el and ppc64el, so not just 32-bit systems. tested with internal pmix / libevent and still occurs.

amckinstry commented 6 years ago

This reliably happens on i386 and other archs (but not x86_64). I've tested with OpenMPI 3.1.3 as of Sept. 11.

0xb5933b16 in mca_btl_vader_component_progress () at btl_vader_component.c:690
#1  0xb6e3a380 in opal_progress () at runtime/opal_progress.c:228
#2  0xb4a24b1d in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:413

and on core 2:

 0xb7176af1 in __GI___clock_gettime (clock_id=1, tp=0xbfeaa794) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f02d57 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4  0xb6eb13f8 in opal_progress_events () at runtime/opal_progress.c:181
#5  opal_progress () at runtime/opal_progress.c:244
#6  0xb76ab5a5 in ompi_request_wait_completion (req=0xdced00) at ../ompi/request/request.h:413

The test case is issue46 for arpack: an issue with custom communicators: https://github.com/opencollab/arpack-ng/blob/master/PARPACK/TESTS/MPI/issue46.f

jsquyres commented 6 years ago

This may be related to #5375.

hjelmn commented 6 years ago

@amckinstry Please try this patch:

diff --git a/opal/mca/btl/vader/btl_vader_fbox.h b/opal/mca/btl/vader/btl_vader_fbox.h
index 17239ce8ef..b5526050e0 100644
--- a/opal/mca/btl/vader/btl_vader_fbox.h
+++ b/opal/mca/btl/vader/btl_vader_fbox.h
@@ -50,9 +50,10 @@ void mca_btl_vader_poll_handle_frag (mca_btl_vader_hdr_t *hdr, mca_btl_base_endp
 static inline void mca_btl_vader_fbox_set_header (mca_btl_vader_fbox_hdr_t *hdr, uint16_t tag,
                                                   uint16_t seq, uint32_t size)
 {
-    mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = tag, .seq = seq, .size = size}};
-    opal_atomic_wmb ();
+    mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = 0, .seq = seq, .size = size}};
     hdr->ival = tmp.ival;
+    opal_atomic_wmb ();
+    hdr->data.tag = tag;
 }

 /* attempt to reserve a contiguous segment from the remote ep */
amckinstry commented 6 years ago

Unfortunately not the (full ?) answer. Its now hanging elsewhere:

0xb71afaf1 in __GI___clock_gettime (clock_id=1, tp=0xbf8e7034) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f3bdd7 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4  0xb6eea3f8 in opal_progress_events () at runtime/opal_progress.c:181
#5  opal_progress () at runtime/opal_progress.c:244
#6  0xb76e44a5 in ompi_request_wait_completion (req=0x2145d00) at ../ompi/request/request.h:413
#7  ompi_request_default_wait (req_ptr=0xbf8e7134, status=0xbf8e7138) at request/req_wait.c:42
#8  0xb773a0a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x212b670, scount=1, sdatatype=0xb779cf00 <ompi_mpi_dblprec>, dest=1, stag=-12, recvbuf=0xbf8e7368,
    rcount=1, rdatatype=0xb779cf00 <ompi_mpi_dblprec>, source=1, rtag=-12, comm=0xb77a4fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#9  0xb773a4a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf8e7360, rbuf=0xbf8e7368, count=1, dtype=0xb779cf00 <ompi_mpi_dblprec>,
    op=0xb77b3a20 <ompi_mpi_op_max>, comm=0xb77a4fe0 <ompi_mpi_comm_world>, module=0x21472f0) at base/coll_base_allreduce.c:217
#10 0xb491b860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf8e7360, rbuf=0xbf8e7368, count=1, dtype=0xb779cf00 <ompi_mpi_dblprec>,
    op=0xb77b3a20 <ompi_mpi_op_max>, comm=0xb77a4fe0 <ompi_mpi_comm_world>, module=0x21472f0) at coll_tuned_decision_fixed.c:66
#11 0xb76f9f9b in PMPI_Allreduce (sendbuf=0xbf8e7360, recvbuf=0xbf8e7368, count=1, datatype=0xb779cf00 <ompi_mpi_dblprec>, op=0xb77b3a20 <ompi_mpi_op_max>,
    comm=0xb77a4fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#12 0xb77eae78 in ompi_allreduce_f (sendbuf=0xbf8e7360 'ڼ\264\334\313\361\373?\246w\324\344\000', recvbuf=0xbf8e7368 '\246w\324\344\000', count=0xb7fad1c0,
    datatype=0xb7fad1c4, op=0xb7fad1c0, comm=0x4dd2b4, ierr=0xbf8e735c) at pallreduce_f.c:87
#13 0xb7f8ec6a in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:73

and

 0xb7184af1 in __GI___clock_gettime (clock_id=1, tp=0xbf9a8054) at ../sysdeps/unix/clock_gettime.c:115
#3  0xb6f10dd7 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4  0xb6ebf3f8 in opal_progress_events () at runtime/opal_progress.c:181
#5  opal_progress () at runtime/opal_progress.c:244
#6  0xb76b94a5 in ompi_request_wait_completion (req=0x2581d00) at ../ompi/request/request.h:413
#7  ompi_request_default_wait (req_ptr=0xbf9a8154, status=0xbf9a8158) at request/req_wait.c:42
#8  0xb770f0a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x2583ca0, scount=2, sdatatype=0xb7771f00 <ompi_mpi_dblprec>, dest=0, stag=-12, recvbuf=0xbf9acb00,
    rcount=2, rdatatype=0xb7771f00 <ompi_mpi_dblprec>, source=0, rtag=-12, comm=0xb7779fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#9  0xb770f4a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf9acb10, rbuf=0xbf9acb00, count=2, dtype=0xb7771f00 <ompi_mpi_dblprec>,
    op=0xb7787a20 <ompi_mpi_op_sum>, comm=0xb7779fe0 <ompi_mpi_comm_world>, module=0x2583530) at base/coll_base_allreduce.c:217
#10 0xb48f4860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf9acb10, rbuf=0xbf9acb00, count=2, dtype=0xb7771f00 <ompi_mpi_dblprec>,
    op=0xb7787a20 <ompi_mpi_op_sum>, comm=0xb7779fe0 <ompi_mpi_comm_world>, module=0x2583530) at coll_tuned_decision_fixed.c:66
#11 0xb76cef9b in PMPI_Allreduce (sendbuf=0xbf9acb10, recvbuf=0xbf9acb00, count=2, datatype=0xb7771f00 <ompi_mpi_dblprec>, op=0xb7787a20 <ompi_mpi_op_sum>,
    comm=0xb7779fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#12 0xb77bfe78 in ompi_allreduce_f (sendbuf=0xbf9acb10 '\344}}\371Z\201\345?\306\022g\312\302\\\374?\000', recvbuf=0xbf9acb00 '\000',
    count=0xb7f96790 <j.3875>, datatype=0xb7f81044, op=0xb7f81040, comm=0x4d62b4, ierr=0xb7f9677c <ierr.3868>) at pallreduce_f.c:87
#13 0xb7f5e2b0 in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
    workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:610
#14 0xb7f6049d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
    mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
    at pdsaup2.f:391
jsquyres commented 6 years ago

See PR #5696.

amckinstry commented 6 years ago

@hjelmn There is more to the commit in https://github.com/open-mpi/ompi/pull/5696/commits/850fbff441756b2f9cde1007ead3e37ce22599c2 than in the patch listed above ; is that deliberate ?

jsquyres commented 6 years ago

@amckinstry Yes. 850fbff441756b2f9cde1007ead3e37ce22599c2 should be everything that is needed (even though it is more than what was described on this issue).

jsquyres commented 6 years ago

@amckinstry Could you let us know if this fixes the issue on your platforms?

amckinstry commented 6 years ago

@jsquyres Just tested the full patch 850fbff. no joy.

jsquyres commented 6 years ago

Doh!

So just to be clear:

And you still get hangs on i386?

amckinstry commented 6 years ago

Yes.

jsquyres commented 6 years ago

@amckinstry Sorry -- one more clarification: does it only hang on i386, or does it still hang on other 32 bit platforms?

amckinstry commented 6 years ago

Multiple: i386 (armhf, armel ; arm64 seems fine) and powerpc. Also kfreebsd-i386 and kfreebsd-amd64 ( Debian project with kFreebsd kernel ). Works on all mips variants and ppc64el .

(Details here: https://buildd.debian.org/status/package.php?p=arpack as arpack test suite fails). Hard to see a pattern: not 32-bit only, or big-endian; some arch-sensitive code somewhere ?

hjelmn commented 6 years ago

Planning to build a 386 virtual machine to take a closer look. Really surprised this is still happening.

hjelmn commented 6 years ago

For powerpc are you using --disable-builtin-atomics?

amckinstry commented 6 years ago

No, not using --disable-builtin-atomics on any arch at the moment. Using --enable-builtin-atomics on s390x (which works). Currently I'm only testing on a 2-core VM on my laptop (i386) for simplicity.

jsquyres commented 6 years ago

@hjelmn We talked about this on the webex yesterday. It may be that you fixed one area of vader, but we're running into another problem. All we know is that it's hanging for @amckinstry -- not necessarily that it's the same exact problem.

jsquyres commented 6 years ago

@amckinstry If you could test the patch from #5829 and see if that fixes the issue (in addition to the other 2 patches), we'd greatly appreciate it. Thanks!

amckinstry commented 6 years ago

That appears to fix it, thanks!

gpaulsen commented 6 years ago

If this is on v3.0.x and v3.1.x, it is probably also an issue for v4.0.x.

jsquyres commented 6 years ago

PR's created (and linked to this issue) for all release branches.

jsquyres commented 6 years ago

Merged into all release branches -- huzzah!