OpenMPI 2.0.0rc2 with openib/iWARP btl stalls running OSU/IMB benchmarks

larrystevenwise commented 8 years ago

I have installed this OpenMPI on my lab machines. I find that the osu_latency test runs OK. However, the osu_bw test stalls. Upon doing some more investigation, I find the that calls to MPI_ISend on the first node all complete and the calls to MPI_irecv on the second node all complete. However when the nodes go into MPI_Waitall both nodes stall.

It is important that we identify where the problem is before openmpi releases 2.0 as final.

larrystevenwise commented 8 years ago

I built openmpi 2..0.0rc2 with the following configure command

./configure --prefix=/usr/local/ --enable-mpi-thread-multiple --enable-mpirun-prefix-by-default --enable-debug --enable-mpi-cxx

Does not matter about mpi-thread-multiple. Tried it both ways.

larrystevenwise commented 8 years ago

When stalled we see one rank with a pending SEND WR. HW has consumed the WR and it appears the node has transmitted the SEND to the peer.

larrystevenwise commented 8 years ago

We started bisecting the ompi release git repo.. For commits before march of ompi, osu tests are working fine. Will update the culprit commit..

larrystevenwise commented 8 years ago

Only osu_bcast and osu_bw have this issue and other OSU tests are running fine. Tests are hanging with no traffic running. IMB tests are also not working on openmpi2.0.0rc2.

larrystevenwise commented 8 years ago

@jsquyres, I think this is a blocker for 2.0.0. Chelsio is working this as top priority. I'm also pushing on Chelsio to regression test new RCs as they release. This issue should not have made it to rc2 IMO...

jsquyres commented 8 years ago

@larrystevenwise Yes, would have been great to know that this bug existed before now. No one else in the community has Chelsio equipment with which to test.

larrystevenwise commented 8 years ago

@jsquyres, sorry about that. I can ask Chelsio to give some cards to someone who would test OMPI/openib/cxgb4 regularly.

jsquyres commented 8 years ago

I think the community would prefer if you or Chelsio did regular testing. 😄

larrystevenwise commented 8 years ago

agreed... :)

I've recommended they test rc1 of each release and then regression test each rc after that.

jsquyres commented 8 years ago

FWIW, it would be best if they could test more than that -- even submitting nightly (or even weekly?) MTT runs to the community database would be a good heads-up to catch problems before RC releases.

Absoft, for example, does this.

larrystevenwise commented 8 years ago

I've passed this along. Thanks Jeff.

larrystevenwise commented 8 years ago

I installed OFED-3.18-2-rc1 and the stall goes away. So:

OMPI-2.0.0rc2, rhel 6.5 kernel, OFED-3.18-1, chelsio's latest drivers: BAD OMPI-2.0.0rc2, rhel 6.5 kernel, OFED-3.18-2-rc1, chelsio's latest drivers: GOOD

It could be that OMPI configures differently against OFED-3.18-1 vs OFED-3.18-2 and that is causing the change in behavior due to different code running in OMPI itself, or it could be the library/driver differences in OFED-3.18-1 and OFED-3.18-2...

larrystevenwise commented 8 years ago

git bisect results:

Here is the culprit commit of ompi release repo: From this commit onward bw test fails with an rdma_cm error or stalls.

repo: https://github.com/open-mpi/ompi-release.git   
branch v2.X 
[root@maneybhanjang ompi-release]# git bisect bad
d7c21da1b835be1f7f035d33550587e3e1954b0b is the first bad commit
commit d7c21da1b835be1f7f035d33550587e3e1954b0b
Author: Jeff Squyres 
Date:   Wed Feb 3 18:47:51 2016 -0800
    ompi_mpi_params.c: set mpi_add_procs_cutoff default to 0
    Decrease the default value of the "mpi_add_procs_cutoff" MCA param
    from 1024 to 0.
    (cherry picked from commit open-mpi/ompi@902b477aac2063976578f031aa079185104e07e0)
:040000 040000 62e9175129fcd9e581a813451fcc948d70298456 18bec256d8f39734aacf0c1d1f952c8f289a6ee6 M      ompi

We are re-verifying this. I still don't know how this relates to OFED-3.18-1 or -2 causing different behavior...

jsquyres commented 8 years ago

If this is a bug in OFED-3.18-1, should we care?

larrystevenwise commented 8 years ago

We don't care if the problem doesn't exist in OFED-3.18-2, but I would still like to understand the root cause. And the problem is avoided with backing out the identified commit. Also: running with '--mca mpi_add_procs_cutoff 1024' added to the mpirun works around the problem.

Jeff, can you please explain what mpi_add_procs_cutoff does and how it might affect openib btl? Any ideas?

jsquyres commented 8 years ago

@larrystevenwise The behavior of mpi_add_procs_cutoff was discussed by the community literally for months. I'm actually pretty annoyed that Chelsio is coming in literally at the last second and calling for a full stop, despite the fact that they completely ignored the community for the entire development cycle. 😠

There is not a short explanation of mpi_add_procs_cutoff. We fundamentally changed the behavior of how btl_add_procs is invoked: it's only invoked for a given peer the first time we communicate with that peer. I'm sorry; I don't have time to explain further (I have other deadlines this week -- your poor prior planning does not constitute an emergency on my part). Go read the code and git logs.

larrystevenwise commented 8 years ago

Since there is an easy workaround, i'm not concerned about holding the release (which you wouldn't do anyway apparently). I've expressed the issue of spending more resources on OMPI with chelsio and I'll work to get them more involved and doing regular regression testing.

larrystevenwise commented 8 years ago

What we see when looking at the 2 ranks that are stalled is that one rank has posted a SEND WR which has been transmitted and processed by the peer rank, but the CQE for that SEND has never been polled. And I think there's another RECV CQE sitting in the CQ that the rank hasn't polled either. So it seems the rank has stopped polling the CQ for some reason. Debugging further.

larrystevenwise commented 8 years ago

Both the RCQ and SCQ for the stuck rank have never been successfully polled. The RCQ has a handful of completions and the SCQ has a send completion. Seems like the rank never got going on polling the CQs...

larrystevenwise commented 8 years ago

the stuck rank seems to be stuck doing this and never proceeding:

#0  0x00007fff56f40952 in vgetns (ts=0x2af6dab28d70) at arch/x86/vdso/vclock_gettime.c:42
#1  do_monotonic (ts=0x2af6dab28d70) at arch/x86/vdso/vclock_gettime.c:81
#2  0x00007fff56f40a47 in __vdso_clock_gettime (clock=, ts=) at arch/x86/vdso/vclock_gettime.c:124
#3  0x0000003ef6803e46 in clock_gettime () from /lib64/librt.so.1
#4  0x00002af6d99f4647 in gettime (base=0x1f2b1c0, tp=0x1f2b350) at event.c:372
#5  0x00002af6d99f8961 in update_time_cache (base=0x1f2b1c0, flags=1) at event.c:430
#6  opal_libevent2022_event_base_loop (base=0x1f2b1c0, flags=1) at event.c:1639
#7  0x00002af6da6dffd9 in progress_engine (obj=0x1f2b1c0) at src/util/progress_threads.c:49
#8  0x0000003ef5c079d1 in start_thread () from /lib64/libpthread.so.0
#9  0x0000003ef58e89dd in clone () from /lib64/libc.so.6

jsquyres commented 8 years ago

Is it not progressing past that point, or is it endlessly looping in in the event base loop in the progress thread?

I note that that is not the main MPI thread.

larrystevenwise commented 8 years ago

some more data points: with the same OFED-3.18-1 code, mlx4 runs ok regardless of mpi_add_procs_cutoff value. cxgb4 stalls if the value is <= 2.

larrystevenwise commented 8 years ago

@bharatpotnuri is working on this. Not sure if he has access to adding to these issues?

bharatpotnuri commented 8 years ago

The main thread of rank1 is looping endlessly at opal_progress(), as it doesn't find registered callbacks for progressing further.

Below is the snippet of code, where is fails to get registered callback: opal/runtime/opal_progress.c:187

187     /* progress all registered callbacks */
188     for (i = 0 ; i < callbacks_len ; ++i) {
189         events += (callbacks[i])();
190     }

Below is the back-trace of main thread:

0x00002b08047dd081 in opal_timer_base_get_usec_clock_gettime () at timer_linux_component.c:184
184             return (tp.tv_sec * 1e6 + tp.tv_nsec/1000);
(gdb) bt
#0  0x00002b08047dd081 in opal_timer_base_get_usec_clock_gettime () at timer_linux_component.c:184
#1  0x00002b0804729abc in opal_progress () at runtime/opal_progress.c:161
#2  0x00002b080409f590 in opal_condition_wait (c=0x2b08043e3380, m=0x2b08043e3300) at ../opal/threads/condition.h:72
#3  0x00002b080409fbd7 in ompi_request_default_wait_all (count=64, requests=0x602b80, statuses=0x608940) at request/req_wait.c:272
#4  0x00002b08041067f4 in PMPI_Waitall (count=64, requests=0x602b80, statuses=0x608940) at pwaitall.c:76
#5  0x00000000004010b4 in main (argc=1, argv=0x7fff35570348) at osu_bw.c:121

EDIT: Added verbatim blocks

jsquyres commented 8 years ago

@bharatpotnuri @larrystevenwise Github pro tip: use three back ticks on a line to start and end verbose blocks. See https://help.github.com/articles/basic-writing-and-formatting-syntax/ for more github markdown formatting.

jsquyres commented 8 years ago

So just to be clear: you're saying that opal_progress() is not ending up calling btl_openib_progress() in its loop over the callbacks[] array?

Hmm. I'm not sure how this could happen -- that the openib BTL was evidently selected for use, but then didn't have its progress function registered with the main progression engine (that should be done up in ob1, actually).

What message is being sent by the openib BTL? Is it an MPI message, or some type of control / initialization message?
Is the openib BTL being selected for use?

bharatpotnuri commented 8 years ago

Yes, opal_progress() is not calling btl_openib_component_progress().

The difference in code flow between working case(mpi_add_procs_cutoff > 3 and np=2) and stall case(mpi_add_procs_cutoff = 0 and np=2) is nprocs value is 1 in case of stall and is 2, in working case.

Following the code back, nprocs is propagated through mca_pml_ob1_add_procs(). I am checking further, how it gets called. I am checking answers for your two questions. I will rerun test with few more prints and let you know soon. Thanks.

bharatpotnuri commented 8 years ago

This is the test command which stalls:

/usr/mpi/gcc/openmpi-2.0.0/bin/mpirun -np 2 --host peer1,peer2 --allow-run-as-root --output-filename /tmp/mpi --mca btl openib,sm,self /usr/mpi/gcc/openmpi-2.0.0/tests/OSU/osu_bw

bharatpotnuri commented 8 years ago

What message is being sent by the openib BTL? Is it an MPI message, or some type of control / initialization message?
Is the openib BTL being selected for use?

@jsquyres Could you please tell me what to check for in MPI code, for your above two questions. I couldn't figure it out from ompi code.

jsquyres commented 8 years ago

I chatted with @hjelmn about this in IM. He's thinking that the issue might be in bml/r2 somewhere (r2 is the code that ob1 uses to multiplex over all of its BTLs), because r2 is where the BTL component progress functions are supposed to be registered.

He's poking into this today to see what's going on...

larrystevenwise commented 8 years ago

Perhaps this issue has something to do with the fact that cxgb4/iwarp is using the rdmacm CPC, and thus the CTS message? I see with openib/mlx4, which does not stall, the rdmacm CPC is not being used.

[stevo2.asicdesigners.com:27045] rdmacm CPC only supported when the first QP is a PP QP; skipped
[stevo2.asicdesigners.com:27045] openib BTL: rdmacm CPC unavailable for use on mlx4_0:1; skipped

hjelmn commented 8 years ago

Nah. This is caused by your network not supporting loopback. In that case openib is disqualified on the initial add_procs (local procs) only but then selected on a future add_proc call. mlnx isn't affected as the openib btl is not disqualified on local procs.

larrystevenwise commented 8 years ago

cxgb4 does support hw loobpack, but maybe the openib btl doesn't know that?

larrystevenwise commented 8 years ago

Ah. from mca_btl_openib_add_procs():

#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
        /* Most current iWARP adapters (June 2008) cannot handle
           talking to other processes on the same host (!) -- so mark
           them as unreachable (need to use sm).  So for the moment,
           we'll just mark any local peer on an iWARP NIC as
           unreachable.  See trac ticket #1352. */
        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
            continue;
        }
#endif

larrystevenwise commented 8 years ago

cxgb3 did not support hw loobpack. cxgb4 and beyond does.

jsquyres commented 8 years ago

I think @hjelmn will reply with detail shortly about what he just found.

@larrystevenwise Feel free to submit a PR to amend that #if block you just found. It needs to be a generic solution, though -- we have avoided "if device X, do Y" kinds of hard-coding in the openib BTL whenever possible.

larrystevenwise commented 8 years ago

@jsquyres I'll look into a way to detect hw loobpack support. But the OpenFabrics Verbs API doesn't advertise this capability from what I can tell.

But the 'self' btl was included in the mpirun, so maybe the openib shouldn't be disqualified?

larrystevenwise commented 8 years ago

One gross way to detect hw loopback in openib is to attempt a hw loopback connection as part of module init...

hjelmn commented 8 years ago

We already do that in the openib add_procs. You can remove the #if block and see if it is working with your hardware.

I am working on a fix so it will not hang when local communication is not available. Should be ready shortly.

larrystevenwise commented 8 years ago

Removing the 2 blocks in btl_openib.c that excluded iwarp devices works around the issue.

diff --git a/opal/mca/btl/openib/btl_openib.c b/opal/mca/btl/openib/btl_openib.c
index 53d8c81..2e4211f 100644
--- a/opal/mca/btl/openib/btl_openib.c
+++ b/opal/mca/btl/openib/btl_openib.c
@@ -1065,18 +1065,6 @@ int mca_btl_openib_add_procs(
         struct opal_proc_t* proc = procs[i];
         mca_btl_openib_proc_t* ib_proc;

-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
-        /* Most current iWARP adapters (June 2008) cannot handle
-           talking to other processes on the same host (!) -- so mark
-           them as unreachable (need to use sm).  So for the moment,
-           we'll just mark any local peer on an iWARP NIC as
-           unreachable.  See trac ticket #1352. */
-        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
-            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
-            continue;
-        }
-#endif
-
         if(NULL == (ib_proc = mca_btl_openib_proc_get_locked(proc)) ) {
             /* if we don't have connection info for this process, it's
              * okay because some other method might be able to reach it,
@@ -1138,18 +1126,6 @@ int mca_btl_openib_add_procs(

         opal_output(-1, "add procs: adding proc %d", i);

-#if defined(HAVE_STRUCT_IBV_DEVICE_TRANSPORT_TYPE)
-        /* Most current iWARP adapters (June 2008) cannot handle
-           talking to other processes on the same host (!) -- so mark
-           them as unreachable (need to use sm).  So for the moment,
-           we'll just mark any local peer on an iWARP NIC as
-           unreachable.  See trac ticket #1352. */
-        if (IBV_TRANSPORT_IWARP == openib_btl->device->ib_dev->transport_type &&
-            OPAL_PROC_ON_LOCAL_NODE(proc->proc_flags)) {
-            continue;
-        }
-#endif
-
         if(NULL == (ib_proc = mca_btl_openib_proc_get_locked(proc)) ) {
             /* if we don't have connection info for this process, it's
              * okay because some other method might be able to reach it,

EDIT: Added diff syntax hilighting

jsquyres commented 8 years ago

@larrystevenwise Another github pro tip: you can tell github how it should syntax hilight your verbatim blocks. For example, trail the 3 back ticks with "diff" (without the quotes) and it tells Github to render that verbatim block with diff syntax hilighting.

larrystevenwise commented 8 years ago

nifty!

bharatpotnuri commented 8 years ago

By removing the 2 blocks in btl_openib.c that excluded iWARP devices works around the issue. The test seems to be working but is failing to complete.

# OSU MPI Bi-Directional Bandwidth Test v5.0
# Size      Bandwidth (MB/s)
1                       0.53
2                       1.07
4                       2.14
8                       4.24
16                      8.53
32                     16.97
64                     32.76
128                    63.48
256                   121.18
512                   242.11
1024                  441.96
2048                  752.99
4096                 1051.12
8192                 1385.62
16384                3153.99
32768                4147.69
65536                4857.31
131072               5495.65
262144               5690.68
524288               5768.05
1048576              5777.00
2097152              6132.86
4194304              6336.14

The main thread is waiting at rdmacm_endpoint_finalize(). Below is the back-trace of rank 1:

0x0000003ef5c0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
(gdb) bt
#0  0x0000003ef5c0b5bc in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
#1  0x00002b38408a66e1 in rdmacm_endpoint_finalize (endpoint=0x81b710) at connect/btl_openib_connect_rdmacm.c:1231
#2  0x00002b3840891839 in mca_btl_openib_endpoint_destruct (endpoint=0x81b710) at btl_openib_endpoint.c:338
#3  0x00002b384087a998 in opal_obj_run_destructors (object=0x81b710) at ../../../../opal/class/opal_object.h:455
#4  0x00002b3840880a66 in mca_btl_openib_finalize_resources (btl=0x7d63c0) at btl_openib.c:1627
#5  0x00002b38408812e1 in mca_btl_openib_finalize (btl=0x7d63c0) at btl_openib.c:1703
#6  0x00002b3832a9f812 in mca_btl_base_close () at base/btl_base_frame.c:153
#7  0x00002b3832a89299 in mca_base_framework_close (framework=0x2b3832d3e5a0) at mca_base_framework.c:214
#8  0x00002b383243d5f6 in mca_bml_base_close () at base/bml_base_frame.c:130
#9  0x00002b3832a89299 in mca_base_framework_close (framework=0x2b38326e91a0) at mca_base_framework.c:214
#10 0x00002b38323bec35 in ompi_mpi_finalize () at runtime/ompi_mpi_finalize.c:418
#11 0x00002b38323eba6d in PMPI_Finalize () at pfinalize.c:45
#12 0x0000000000401184 in main (argc=1, argv=0x7fff781a1ee8) at osu_bw.c:136

ompi-release/opal/mca/btl/openib/connect/btl_openib_connect_rdmacm.c:1231

1228     /* Now wait for all the disconnect callbacks to occur */
1229     pthread_mutex_lock(&rdmacm_disconnect_lock);
1230     while (opal_list_get_size (&contents->ids)) {
1231         pthread_cond_wait (&rdmacm_disconnect_cond, &rdmacm_disconnect_lock);
1232     }

I am running the test with git ompi-release repo, "commit :a7e27ef6, Merge pull request #1162 from abouteiller/bug/v2.x/mt_lazy_addproc". Looking further.

bharatpotnuri commented 8 years ago

Please ignore my previous comment. I have tested with commit a679cc072b9458957671886062c643f6a40b0351 bml/r2: always add btl progress function, it is working fine with iWARP and fixes the issue #1664 .

jsquyres commented 8 years ago

Great! @hjelmn please PR to v2.x if you have not done so already.

larrystevenwise commented 8 years ago

So should i close this issue out now?

open-mpi / ompi

OpenMPI 2.0.0rc2 with openib/iWARP btl stalls running OSU/IMB benchmarks #1664