pmodels / mpich

Official MPICH Repository
http://www.mpich.org
Other
535 stars 280 forks source link

MPI_Bsend and MPI_Ibsend are not thread-safe #1879

Closed mpichbot closed 7 years ago

mpichbot commented 7 years ago

Originally by blocksom on 2013-06-06 10:16:59 -0500


ALLFUNC is yielded/unlocked in MPIU_THREAD_GRANULARITY_GLOBAL and doesn't protect the non-threadsafe static data in bsendutil.c. Also, there is no lock at all for MPIU_THREAD_GRANULARITY_PER_OBJECT!

A pamid-only fix will be pushed soon, however this commit needs to be updated for other mpid devices before it can be pushed to the master branch.

mpichbot commented 7 years ago

Originally by haizhu on 2013-06-06 12:16:03 -0500


After an internal review I've signed-off on this commit and pushed it to the branch ticket-1879 in the mpich-ibm git repository.

mpichbot commented 7 years ago

Originally by bobc on 2013-06-07 10:23:02 -0500


Attachment added: ibsend.c (2.8 KiB) testcase

mpichbot commented 7 years ago

Originally by bobc on 2013-06-07 10:23:17 -0500


Attachment added: ibsend.sh (1.6 KiB) script

mpichbot commented 7 years ago

Originally by blocksom on 2013-07-17 11:00:52 -0500


I rebased this to master, pushed as branch mpich-ibm/ticket-1879-alt, and deleted branch mpich-ibm/ticket-1879.

mpichbot commented 7 years ago

Originally by balaji on 2013-08-07 07:13:41 -0500


Contrary to what the comment on this patch says, this is not a PAMI-only change. It touches a bunch of files above pamid as well. Also, it defines a few macros in pamid, but doesn't define the equivalent ones in ch3. So it won't build on ch3. Moreover, the logic of trying to exit ALLFUNC and enter BSENDDATA doesn't quite look right to me (though that part is in the pamid code, and I might not know the usage of the threading macros in there).

mpichbot commented 7 years ago

Originally by blocksom on 2013-08-07 08:21:10 -0500


I think this patch is not intended to be directly applied to the mpich/master branch, but is provided as an example as one way to fix the fundamental thread-safety issue.

The history here is that bgq had a customer problem concerning bsend. We fixed the problem in our product branch - which is based on mpich2 v1.5 - and then ported the fix to the mpich 3.* code base for the purpose of discussing what the real solution should be.

This code is not in the mpich-ibm/build branch, so this bug remain in our latest product builds (but not in that previous bgq v1r2m1 release). I think that any other solution that passes the attached test case would be acceptable.

mpichbot commented 7 years ago

Originally by bobc on 2013-08-07 09:20:14 -0500


I'm not sure exactly what is in your branch comments but I had two patches on the BGQ branch and they were PAMID only, not PAMI only. And I didn't mean I only touched PAMID, I meant I was only fixing PAMID. I specifically said I wasn't implementing it (lock) on non-PAMID platforms.

I posted a testcase, and discussed the problem with Dave Goodell, on the mailing list back in April.

Author: Bob Cernohous bobc@us.ibm.com Date: Fri May 17 10:53:45 2013 -0500

CPS 97RGJN: PAMID-only fix for multi-threaded MPI_Ibsend.

Extend the earlier MPI_Bsend fix (Issue 9524).

Put the new 'big' lock around the MPI_Ibsend, MPI_Buffer_xx and MPI_Startxx
where they make calls to non-threadsafe bsendutil.c functions.

I'm still not implementing this lock for non-PAMID builds at this time.

Author: Bob Cernohous bobc@us.ibm.com Date: Mon Apr 22 09:41:34 2013 -0500

Issue 9524: PAMID-only fix for multi-threaded MPI_Bsend.

I'm putting a new 'big' lock around the API since ALLFUNC is yielded/unlocked
in MPIU_THREAD_GRANULARITY_GLOBAL and doesn't protect the non-threadsafe
static data in bsendutil.c.  Also, there was no lock at all for
MPIU_THREAD_GRANULARITY_PER_OBJECT!  This new lock will cover both
MPIU_THREAD_GRANULARITY builds.

I'm not implementing this lock for non-PAMID builds at this time.  That's left
to the open source community (later).
mpichbot commented 7 years ago

Originally by huiweilu on 2013-11-11 11:23:54 -0600


First, this bug is caused in the user code of ibsend.c where pthread is created without explicitly setting to be joinable, so the threads are not synchronized at the call of pthread_join. The detached threads screw up with MPI_Comm_free where both of them calls ALLFUNC lock in unexpected order and leads to the assertion failure in initthread.c:213.

Second, BSEND lock is unnecessary. Basically Bob's patch is to create a new lock, BSEND, to lock the function of MPIR_Bsend_isend for thread safety; however, MPIR_Bsend_isend is already called inside a ALLFUNC lock both in MPI_Bsend and MPI_Ibsend.

See attached ibsend-fixed.c for correct use of pthread_create and pthread_join and following log for detail debugging informations.

Test the ibsend.c on two difference platforms:

  1. On x86_64 Macbooks, this test does not fail for 5,000 runs.
  2. On x86_64 Ubuntu, the test fails for ibsend and isend, but not for bsend or send. The fail log is as follows, neither seems to be related to MPI_Bsend or MPI_Ibsend:

    Assertion failed in file src/mpi/init/initthread.c at line 213: err == 0 internal ABORT - process 3

    Assertion failed in file src/mpid/ch3/src/mpid_vc.c at line 161: MPIU_Object_get_ref((vc)) >= 0

The assertion line is:

MPID_Thread_mutex_destroy(&MPIR_ThreadInfo.global_mutex, &err);
MPIU_Assert(err == 0);

Using printf, the err number is "err:16".

Further investigation reveals the problem of pthread_mutex. https://bugzilla.redhat.com/show_bug.cgi?id=113588

If pthread_unlock is called before pthread_lock, the pthread_mutex_destroy will return 16 (EBUSY).

If line 213 is commented out, neither of ibsend/isend fails for 5,000 runs.

If remove the following line in ibsend.c, all tests pass for 5,000 runs. MPI_Comm_free(&communicator);

Check the user code of pthreads, and find pthread_create and pthread_join is not used correctly.

mpichbot commented 7 years ago

Originally by huiweilu on 2013-11-11 11:25:19 -0600


Attachment added: ibsend-fixed.c (4.5 KiB)

mpichbot commented 7 years ago

Originally by jczhang on 2013-11-11 21:48:07 -0600


What I found is different. See http://linux.die.net/man/3/pthread_create

"by default, a new thread is created in a joinable state, unless attr was set to create the thread in a detached state (using pthread_attr_setdetachstate(3))."

mpichbot commented 7 years ago

Originally by huiweilu on 2013-11-11 22:25:29 -0600


OK.

There is another fix I forgot to mention, which may then be the cause of bug: the old code creates the same thread multiple times:

for (k=0;k<NUMSENDS;k++)
    pthread_create(&sender_thread, NULL, &sender, NULL);

Anyway, you can use a bash script to run the new ibsend thousands of times to see if it is bug free now. I have tested 5,000 times without error.

Replying to jczhang:

What I found is different. See http://linux.die.net/man/3/pthread_create

"by default, a new thread is created in a joinable state, unless attr was set to create the thread in a detached state (using pthread_attr_setdetachstate(3))."

mpichbot commented 7 years ago

Originally by jczhang on 2013-11-11 22:40:01 -0600


Yes. the old code used a wrong sender_thread data structure.

mpichbot commented 7 years ago

Originally by Huiwei Lu huiweilu@mcs.anl.gov on 2013-11-13 10:12:27 -0600


In ad44b6b335b1b07b4ef9c4baf22341ad6d76710e: Fix #1879 - Adds a test case for ibsend

Adds a test case for ibsend/bsend/isend/send in threading execution.

mpichbot commented 7 years ago

Originally by Huiwei Lu huiweilu@mcs.anl.gov on 2013-11-13 10:12:27 -0600


In 85e4a5072e3b1edd3651357601e9e72c18f887b6: Fix #1879 - false alarm, bug is in user code

First, this ticket is caused in the user code of ibsend.c where multiple pthreads are created using the same pthread variable. This leads to unexpected behaviors where detached threads screw up with MPI_Comm_free when both of them calls ALLFUNC lock in unexpected order, thus leads to the assertion failure in initthread.c:213.

Second, BSEND lock in the proposed patch in trac is unnecessary. Basically the patch is to create a new lock, BSEND, to lock the function of MPIR_Bsend_isend for thread safety; however, MPIR_Bsend_isend is already called inside a ALLFUNC lock both in MPI_Bsend and MPI_Ibsend.

See attached test/mpi/threads/pt2pt/ibsend.c for correct use of pthread_create and pthread_join

mpichbot commented 7 years ago

Originally by Huiwei Lu huiweilu@mcs.anl.gov on 2013-11-13 10:12:27 -0600


In 40c6fda95428adbaf8938d33031d24b160af9539: Fix #1879 - fix test script

fix testlist

Signed-off-by: Junchao Zhang jczhang@mcs.anl.gov