Closed amckinstry closed 6 years ago
With a debug build:
#0 mca_btl_vader_check_fboxes () at btl_vader_fbox.h:170
#1 mca_btl_vader_component_progress () at btl_vader_component.c:702
#2 0xb6df9380 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#3 0xb75f34a5 in ompi_request_wait_completion (req=0x13f4d00) at ../ompi/request/request.h:413
#4 ompi_request_default_wait (req_ptr=0xbf7fdda4, status=0xbf7fdda8) at request/req_wait.c:42
#5 0xb76490a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x13da6c0, scount=1, sdatatype=0xb76abf00 <ompi_mpi_dblprec>, dest=0, stag=-12, recvbuf=0xbf7fdfe0,
rcount=1, rdatatype=0xb76abf00 <ompi_mpi_dblprec>, source=0, rtag=-12, comm=0xb76b3fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#6 0xb76494a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf7fdfd0, rbuf=0xbf7fdfe0, count=1, dtype=0xb76abf00 <ompi_mpi_dblprec>,
op=0xb76c1a20 <ompi_mpi_op_sum>, comm=0xb76b3fe0 <ompi_mpi_comm_world>, module=0x13f62f0) at base/coll_base_allreduce.c:217
#7 0xb49e2860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf7fdfd0, rbuf=0xbf7fdfe0, count=1, dtype=0xb76abf00 <ompi_mpi_dblprec>,
op=0xb76c1a20 <ompi_mpi_op_sum>, comm=0xb76b3fe0 <ompi_mpi_comm_world>, module=0x13f62f0) at coll_tuned_decision_fixed.c:66
#8 0xb7608f9b in PMPI_Allreduce (sendbuf=0xbf7fdfd0, recvbuf=0xbf7fdfe0, count=1, datatype=0xb76abf00 <ompi_mpi_dblprec>, op=0xb76c1a20 <ompi_mpi_op_sum>,
comm=0xb76b3fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#9 0xb76f9e78 in ompi_allreduce_f (sendbuf=0xbf7fdfd0 "\030!甯A\353?\001먠\335\036\364?FLl\366̑\362?\220\347", <incomplete sequence \355\267>,
recvbuf=0xbf7fdfe0 "FLl\366̑\362?\220\347", <incomplete sequence \355\267>, count=0xb7eca1c0, datatype=0xb7eca1c4, op=0xb7eca1c8, comm=0x43d2b4,
ierr=0xbf7fdfcc) at pallreduce_f.c:87
#10 0xb7eabcc3 in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:79
#11 0xb7ea714d in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:776
#12 0xb7ea849d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
at pdsaup2.f:391
#13 0xb7ea913e in pdsaupd (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, tol=1.1102230246251565e-16, resid=..., ncv=20, v=..., ldv=256, iparam=...,
ipntr=..., workd=..., workl=..., lworkl=560, info=0, _bmat=1, _which=2) at pdsaupd.f:630
#14 0x0043b977 in parnoldi (comm=0) at issue46.f:200
#15 0x0043c58f in issue46 () at issue46.f:21
#16 0x0043b1ec in main (argc=1, argv=0xbf80401e) at issue46.f:23
#17 0xb6fd19a1 in __libc_start_main (main=0x43b1b0 <main>, argc=1, argv=0xbf803524, init=0x43c5d0 <__libc_csu_init>, fini=0x43c630 <__libc_csu_fini>,
rtld_fini=0xb7ef5f60 <_dl_fini>, stack_end=0xbf80351c) at ../csu/libc-start.c:310
#18 0x0043b22c in _start ()
(gdb) quit
A debugging session is active.
and
#0 0xb5a11b46 in mca_btl_vader_component_progress () at btl_vader_component.c:701
#1 0xb6ee2380 in opal_progress () from /usr/lib/i386-linux-gnu/libopen-pal.so.40
#2 0xb4ae5b1d in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:413
#3 mca_pml_ob1_recv (addr=<optimized out>, count=<optimized out>, datatype=<optimized out>, src=<optimized out>, tag=<optimized out>, comm=<optimized out>,
status=<optimized out>) at pml_ob1_irecv.c:135
#4 0xb77135a8 in PMPI_Recv (buf=0xbff0a9c0, count=10, type=0xb7794f00 <ompi_mpi_dblprec>, source=1, tag=0, comm=0xb779cfe0 <ompi_mpi_comm_world>,
status=0xbff09d28) at precv.c:79
#5 0xb77ec0a5 in ompi_recv_f (
buf=0xbff0a9c0 "\254g\027\033-ư?\362L,Lq\305\"?[\004\070\347\060\216ÿK\205tH\316lÿ'\265\276\231\037\360\223?\345\210\033]sf\251\277\240XW-\257'\303?-qz\263A;\245\277x\240|\022d*\254?\263\021\202\006\274\232\250\277", count=0xbff09e00, datatype=0x459018, source=0xbff09d18, tag=0xbff09d14, comm=0x4592b4,
status=0xbff09d28, ierr=0xbff09d10) at precv_f.c:85
#6 0x004576a1 in av (comm=0, nloc=50, nx=10, mv_buf=..., v=..., w=...) at issue46.f:436
#7 0x0045792a in parnoldi (comm=0) at issue46.f:215
#8 0x0045858f in issue46 () at issue46.f:21
#9 0x004571ec in main (argc=1, argv=0xbff1001e) at issue46.f:23
#10 0xb70ba9a1 in __libc_start_main (main=0x4571b0 <main>, argc=1, argv=0xbff0ee24, init=0x4585d0 <__libc_csu_init>, fini=0x458630 <__libc_csu_fini>,
rtld_fini=0xb7fdef60 <_dl_fini>, stack_end=0xbff0ee1c) at ../csu/libc-start.c:310
https://buildd.debian.org/status/package.php?p=elpa
Fails/hangs on mips64el and ppc64el, so not just 32-bit systems. tested with internal pmix / libevent and still occurs.
This reliably happens on i386 and other archs (but not x86_64). I've tested with OpenMPI 3.1.3 as of Sept. 11.
0xb5933b16 in mca_btl_vader_component_progress () at btl_vader_component.c:690
#1 0xb6e3a380 in opal_progress () at runtime/opal_progress.c:228
#2 0xb4a24b1d in ompi_request_wait_completion (req=<optimized out>) at ../../../../ompi/request/request.h:413
and on core 2:
0xb7176af1 in __GI___clock_gettime (clock_id=1, tp=0xbfeaa794) at ../sysdeps/unix/clock_gettime.c:115
#3 0xb6f02d57 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4 0xb6eb13f8 in opal_progress_events () at runtime/opal_progress.c:181
#5 opal_progress () at runtime/opal_progress.c:244
#6 0xb76ab5a5 in ompi_request_wait_completion (req=0xdced00) at ../ompi/request/request.h:413
The test case is issue46 for arpack: an issue with custom communicators: https://github.com/opencollab/arpack-ng/blob/master/PARPACK/TESTS/MPI/issue46.f
This may be related to #5375.
@amckinstry Please try this patch:
diff --git a/opal/mca/btl/vader/btl_vader_fbox.h b/opal/mca/btl/vader/btl_vader_fbox.h
index 17239ce8ef..b5526050e0 100644
--- a/opal/mca/btl/vader/btl_vader_fbox.h
+++ b/opal/mca/btl/vader/btl_vader_fbox.h
@@ -50,9 +50,10 @@ void mca_btl_vader_poll_handle_frag (mca_btl_vader_hdr_t *hdr, mca_btl_base_endp
static inline void mca_btl_vader_fbox_set_header (mca_btl_vader_fbox_hdr_t *hdr, uint16_t tag,
uint16_t seq, uint32_t size)
{
- mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = tag, .seq = seq, .size = size}};
- opal_atomic_wmb ();
+ mca_btl_vader_fbox_hdr_t tmp = {.data = {.tag = 0, .seq = seq, .size = size}};
hdr->ival = tmp.ival;
+ opal_atomic_wmb ();
+ hdr->data.tag = tag;
}
/* attempt to reserve a contiguous segment from the remote ep */
Unfortunately not the (full ?) answer. Its now hanging elsewhere:
0xb71afaf1 in __GI___clock_gettime (clock_id=1, tp=0xbf8e7034) at ../sysdeps/unix/clock_gettime.c:115
#3 0xb6f3bdd7 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4 0xb6eea3f8 in opal_progress_events () at runtime/opal_progress.c:181
#5 opal_progress () at runtime/opal_progress.c:244
#6 0xb76e44a5 in ompi_request_wait_completion (req=0x2145d00) at ../ompi/request/request.h:413
#7 ompi_request_default_wait (req_ptr=0xbf8e7134, status=0xbf8e7138) at request/req_wait.c:42
#8 0xb773a0a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x212b670, scount=1, sdatatype=0xb779cf00 <ompi_mpi_dblprec>, dest=1, stag=-12, recvbuf=0xbf8e7368,
rcount=1, rdatatype=0xb779cf00 <ompi_mpi_dblprec>, source=1, rtag=-12, comm=0xb77a4fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#9 0xb773a4a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf8e7360, rbuf=0xbf8e7368, count=1, dtype=0xb779cf00 <ompi_mpi_dblprec>,
op=0xb77b3a20 <ompi_mpi_op_max>, comm=0xb77a4fe0 <ompi_mpi_comm_world>, module=0x21472f0) at base/coll_base_allreduce.c:217
#10 0xb491b860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf8e7360, rbuf=0xbf8e7368, count=1, dtype=0xb779cf00 <ompi_mpi_dblprec>,
op=0xb77b3a20 <ompi_mpi_op_max>, comm=0xb77a4fe0 <ompi_mpi_comm_world>, module=0x21472f0) at coll_tuned_decision_fixed.c:66
#11 0xb76f9f9b in PMPI_Allreduce (sendbuf=0xbf8e7360, recvbuf=0xbf8e7368, count=1, datatype=0xb779cf00 <ompi_mpi_dblprec>, op=0xb77b3a20 <ompi_mpi_op_max>,
comm=0xb77a4fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#12 0xb77eae78 in ompi_allreduce_f (sendbuf=0xbf8e7360 'ڼ\264\334\313\361\373?\246w\324\344\000', recvbuf=0xbf8e7368 '\246w\324\344\000', count=0xb7fad1c0,
datatype=0xb7fad1c4, op=0xb7fad1c0, comm=0x4dd2b4, ierr=0xbf8e735c) at pallreduce_f.c:87
#13 0xb7f8ec6a in pdnorm2 (comm=0, n=50, x=..., inc=1) at pdnorm2.f:73
and
0xb7184af1 in __GI___clock_gettime (clock_id=1, tp=0xbf9a8054) at ../sysdeps/unix/clock_gettime.c:115
#3 0xb6f10dd7 in opal_timer_linux_get_cycles_clock_gettime () at timer_linux_component.c:223
#4 0xb6ebf3f8 in opal_progress_events () at runtime/opal_progress.c:181
#5 opal_progress () at runtime/opal_progress.c:244
#6 0xb76b94a5 in ompi_request_wait_completion (req=0x2581d00) at ../ompi/request/request.h:413
#7 ompi_request_default_wait (req_ptr=0xbf9a8154, status=0xbf9a8158) at request/req_wait.c:42
#8 0xb770f0a5 in ompi_coll_base_sendrecv_actual (sendbuf=0x2583ca0, scount=2, sdatatype=0xb7771f00 <ompi_mpi_dblprec>, dest=0, stag=-12, recvbuf=0xbf9acb00,
rcount=2, rdatatype=0xb7771f00 <ompi_mpi_dblprec>, source=0, rtag=-12, comm=0xb7779fe0 <ompi_mpi_comm_world>, status=0x0) at base/coll_base_util.c:59
#9 0xb770f4a9 in ompi_coll_base_allreduce_intra_recursivedoubling (sbuf=0xbf9acb10, rbuf=0xbf9acb00, count=2, dtype=0xb7771f00 <ompi_mpi_dblprec>,
op=0xb7787a20 <ompi_mpi_op_sum>, comm=0xb7779fe0 <ompi_mpi_comm_world>, module=0x2583530) at base/coll_base_allreduce.c:217
#10 0xb48f4860 in ompi_coll_tuned_allreduce_intra_dec_fixed (sbuf=0xbf9acb10, rbuf=0xbf9acb00, count=2, dtype=0xb7771f00 <ompi_mpi_dblprec>,
op=0xb7787a20 <ompi_mpi_op_sum>, comm=0xb7779fe0 <ompi_mpi_comm_world>, module=0x2583530) at coll_tuned_decision_fixed.c:66
#11 0xb76cef9b in PMPI_Allreduce (sendbuf=0xbf9acb10, recvbuf=0xbf9acb00, count=2, datatype=0xb7771f00 <ompi_mpi_dblprec>, op=0xb7787a20 <ompi_mpi_op_sum>,
comm=0xb7779fe0 <ompi_mpi_comm_world>) at pallreduce.c:110
#12 0xb77bfe78 in ompi_allreduce_f (sendbuf=0xbf9acb10 '\344}}\371Z\201\345?\306\022g\312\302\\\374?\000', recvbuf=0xbf9acb00 '\000',
count=0xb7f96790 <j.3875>, datatype=0xb7f81044, op=0xb7f81040, comm=0x4d62b4, ierr=0xb7f9677c <ierr.3868>) at pallreduce_f.c:87
#13 0xb7f5e2b0 in pdsaitr (comm=0, ido=1, bmat=..., n=50, k=0, np=4, mode=1, resid=..., rnorm=1.7112505600240149, v=..., ldv=256, h=..., ldh=20, ipntr=...,
workd=..., workl=..., info=0, _bmat=1) at pdsaitr.f:610
#14 0xb7f6049d in pdsaup2 (comm=0, ido=1, bmat=..., n=50, which=..., nev=4, np=16, tol=1.1102230246251565e-16, resid=..., mode=1, iupd=1, ishift=1,
mxiter=300, v=..., ldv=256, h=..., ldh=20, ritz=..., bounds=..., q=..., ldq=20, workl=..., ipntr=..., workd=..., info=0, _bmat=1, _which=2)
at pdsaup2.f:391
See PR #5696.
@hjelmn There is more to the commit in https://github.com/open-mpi/ompi/pull/5696/commits/850fbff441756b2f9cde1007ead3e37ce22599c2 than in the patch listed above ; is that deliberate ?
@amckinstry Yes. 850fbff441756b2f9cde1007ead3e37ce22599c2 should be everything that is needed (even though it is more than what was described on this issue).
@amckinstry Could you let us know if this fixes the issue on your platforms?
@jsquyres Just tested the full patch 850fbff. no joy.
Doh!
So just to be clear:
And you still get hangs on i386?
Yes.
@amckinstry Sorry -- one more clarification: does it only hang on i386, or does it still hang on other 32 bit platforms?
Multiple: i386 (armhf, armel ; arm64 seems fine) and powerpc. Also kfreebsd-i386 and kfreebsd-amd64 ( Debian project with kFreebsd kernel ). Works on all mips variants and ppc64el .
(Details here: https://buildd.debian.org/status/package.php?p=arpack as arpack test suite fails). Hard to see a pattern: not 32-bit only, or big-endian; some arch-sensitive code somewhere ?
Planning to build a 386 virtual machine to take a closer look. Really surprised this is still happening.
For powerpc are you using --disable-builtin-atomics?
No, not using --disable-builtin-atomics on any arch at the moment. Using --enable-builtin-atomics on s390x (which works). Currently I'm only testing on a 2-core VM on my laptop (i386) for simplicity.
@hjelmn We talked about this on the webex yesterday. It may be that you fixed one area of vader, but we're running into another problem. All we know is that it's hanging for @amckinstry -- not necessarily that it's the same exact problem.
@amckinstry If you could test the patch from #5829 and see if that fixes the issue (in addition to the other 2 patches), we'd greatly appreciate it. Thanks!
That appears to fix it, thanks!
If this is on v3.0.x and v3.1.x, it is probably also an issue for v4.0.x.
PR's created (and linked to this issue) for all release branches.
Merged into all release branches -- huzzah!
Background information
OpenMPI 3.1.2 PMIX 3.0.1
Installed on Debian /sid.
Testing across our suite of MPI programs, we're seeing hangs on some apps, it looks offhand like the common factor is 32-bit systems: i386, mipsel.
Debian bugs: https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=905418 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907267 https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=907407
This is on simple 2-core systems for the most part. I've a straightforward reproducible case in a VM here on i386 with ARPACK. The backtraces look like
and