Overlap test sometimes failing in Mellanox Jenkins with Yalla

jsquyres commented 8 years ago

Mellanox Jenkins is showing a periodic problem with the overlap test using Yalla. There has been some discussion on https://github.com/open-mpi/ompi-release/pull/1240 about what the actual root cause is, starting around here: https://github.com/open-mpi/ompi-release/pull/1240#issuecomment-227990044.

Please continue the discussion here: @jladd-mlnx @miked-mellanox @artpol84 @nysal @hjelmn

We need to decide if this is a v2.0.0 or v2.0.1 issue. What is the scope of the problem, and how quickly can it be fixed? I'm tentatively marking this as v2.0.0, just so that it comes up on our daily discussions.

I'm opening a separate issue to track the discussion that was occurring on https://github.com/open-mpi/ompi-release/pull/1240 (so that 1240 can go back to its regularly scheduled programming).

artpol84 commented 8 years ago

Here is the analysis and the solution for the problem:

The root cause of the problem is the NEW OMPI request handling code, in particular - locking.
The bug was introduced in open-mpi/ompi-release@f19731bcf1388e7e27c9d20cca7ab48fbc2e497e and unmasked in https://github.com/open-mpi/ompi-release/commit/b4ff061bd45705eb39a91976fad855315ba990cc (I was working with the release originally):

$ git bisect good
b4ff061bd45705eb39a91976fad855315ba990cc is the first bad commit
commit b4ff061bd45705eb39a91976fad855315ba990cc
Author: Nathan Hjelm <hjelmn@lanl.gov>
Date:   Wed May 25 15:42:53 2016 -0600

    pml/yalla: update for request changes

    This commit brings the pml/yalla component up to date with the request
    rework changes.

    Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

    (cherry picked from open-mpi/ompi@9d439664f074a690ff8c550f9ef489bd70520a0f)

    Signed-off-by: Nathan Hjelm <hjelmn@lanl.gov>

:040000 040000 c8047a13774043cf31793730b23f9fcc3cfeb1d6 38e5a2e275229f11f92612f7f1bc541cd9147017 M      ompi

I guess that Mellanox Jenkins was screaming about it since the end of the May.
PR with the fix is proposed here: https://github.com/open-mpi/ompi/pull/1815. I'm still verifying the code, but would like to here comments from @hjelmn and @bosilca.
Problem analysis: In https://github.com/open-mpi/ompi-release/commit/b4ff061bd45705eb39a91976fad855315ba990cc ompi_request_lock locks around critical yalla regions were removed which made possible the following race condition leading to a deadlock:
- In multithreaded cases request processing can appear in one thread (will call it T1) and MPI_Wait is called in another (T2).
- In the T1 we have the following backtrace (see below)
- In wait_sync_update we are not guaranteed that after decreasing count to 0 in line 79 we won't loose the CPU.
- In later case T2 can quickly execute and release sync object in line 348 before T1 would get to WAIT_SYNC_SIGNAL in line 87.
- before https://github.com/open-mpi/ompi-release/commit/b4ff061bd45705eb39a91976fad855315ba990cc the lock in yalla request free callback was holding T2 from releasing sync object before T1's signal.

#2  0x00007fffedec5de0 in wait_sync_update (sync=0x7fffffffd030, updates=1, status=0) at ../../../../opal/threads/wait_sync.h:91
#3  0x00007fffedec5ed3 in ompi_request_complete (request=0xb2c830, with_signal=true) at ../../../../ompi/request/request.h:423
#4  0x00007fffedec6ba0 in mca_pml_yalla_recv_completion_cb (context=0xb2c830) at pml_yalla_request.c:208
#5  0x00007fffedbb64e6 in mxm_proto_progress (context=0x782fc0) at mxm/core/mxm.c:89
#6  mxm_progress (context=0x782fc0) at mxm/core/mxm.c:347
#7  0x00007fffedec31a6 in mca_pml_yalla_progress () at pml_yalla.c:290
#8  0x00007ffff765b8c0 in opal_progress () at runtime/opal_progress.c:225
#9  0x00007fffedec3bc8 in mca_pml_yalla_recv (buf=0x0, count=0, datatype=0x601c40, src=1, tag=34532, comm=0x601640, status=0x0) at pml_yalla.c:383
#10 0x00007ffff7cfdd90 in PMPI_Recv (buf=0x0, count=0, type=0x601c40, source=1, tag=34532, comm=0x601640, status=0x0) at precv.c:77
#11 0x0000000000400c63 in threadfunc ()
#12 0x0000003d698079d1 in start_thread () from /lib64/libpthread.so.0
#13 0x0000003d690e8b6d in clone () from /lib64/libc.so.6

P.S. @hjelmn, Yalla is not guilty and the facts were pointing on the new request code.

artpol84 commented 8 years ago

+ @karasevb

jladd-mlnx commented 8 years ago

@artpol84 Outstanding analysis. @hppritcha - LANL, we'll send you a bill for our services ☺️. @karasevb @artpol84 I think we better ensure that this change hasn't negatively impacted our message rates. Please test this with osu_mbw_mr against 1.10.3. @gpaulsen - is IBM keeping track of this in Spectrum MPI?

jladd-mlnx commented 8 years ago

@artpol84 @karasevb - please check the UCX PML as well. Probably the bug was pushed into that component as well.

rhc54 commented 8 years ago

@artpol84 thanks - we knew Yalla wasn't guilty of having a new error, but clearly there is some interaction that involved the way Yalla used the request code vs how the other pml's etc do, and that really meant you folks needed to be involved. we'll definitely look forward to your patch.

artpol84 commented 8 years ago

@jladd-mlnx this fix is general and fixes this kind of race conditions for all PML's (thanks to centralized approach to locking). Will check osu_mbw_mr on Monday.

gpaulsen commented 8 years ago

@nysal - has been following the locking changes needed, and should be able to comment. We've hit something similar in our PAMI PML, and are just now reacting to it.

artpol84 commented 8 years ago

Currently we have 2 alternative PR's that are trying to solve the same race condition:

One proposed by Mellanox: https://github.com/open-mpi/ompi/pull/1815
And one from LANL: https://github.com/open-mpi/ompi/pull/1816

Let's discuss their proc and cons based on community experience with regard to atomic performance. There is no rush - this bug was pending there since May 24. We can spend some more time before merging.

IMO from the performance perspective they are pretty much the same and in this case I believe that Mellanox patch should have precedence since it was opened first. However I might be lacking of experience with performance details of the atomics and I hope to acquire them from this discussion :). @bosilca @jladd-mlnx please advise and include others who might have an expertise.

Also I believe that either PR we will choose Mellanox copyright must be placed in the opal/threads/wait_sync.h because as it is usual with race conditions it took 95% of work to pinpoint and describe the race and 5% to fix it, also Mellanox contributed to the LANL PR helping to avoid possible new race condition.

rhc54 commented 8 years ago

We generally take whichever has the best performance, followed by the minimal change, and if those are both the same, then we would use the one from the person who typically contributes to that code area. We don't usually get hung up over who proposed something first or get into arguments over who did what amount of work as that is highly subjective. Frankly, I've never seen a case where the two parties didn't just work it out between themselves.

Adding Mellanox copyright to the other PR is mostly a shrug - I've never heard of anyone really caring before as it doesn't signify anything. The lawyers just ask that we mark the files we "touch", and that is all the copyrights in the files are intended to do.

open-mpi / ompi

Overlap test sometimes failing in Mellanox Jenkins with Yalla #1813