IBM test c_accumulate fails when using TCP BTL

rolfv commented 9 years ago

The ibm/onesided/c_accumulate test fails when run over TCP BTL. I do not observe the failure over openib or sm BTL. I also do not observe it on 1.8. This is only on master.

[rvandevaart@drossetti-ivy5 onesided]$ mpirun --mca btl self,tcp --host drossetti-ivy4,drossetti-ivy5 -np 2 c_accumulate
[drossetti-ivy5][[58785,1],1][../../../../../opal/mca/btl/tcp/btl_tcp_frag.c:228:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[**ERROR**]: MPI_COMM_WORLD rank 0, file c_accumulate.c:43:
Rank 0 got 100 when it expected 201
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[rvandevaart@drossetti-ivy5 onesided]$

rolfv commented 9 years ago

This is a CUDA-aware only issue. Taking this issue.

rolfv commented 9 years ago

After further investigation, this has nothing to do with CUDA-aware. It turns out that this is an intermittent problem and that tripped me up. Such a simple test. I am going to give this back to Nathan.

rolfv commented 9 years ago

Just retested with master checkout from March 17, 2015. I can still see the error but it is intermittent. Here is an example:

[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 --mca btl self,tcp -np 2 c_accumulate
[**ERROR**]: MPI_COMM_WORLD rank 0, file c_accumulate.c:43:
Rank 0 got 100 when it expected 201
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[drossetti-ivy5][[62811,1],1][../../../../../opal/mca/btl/tcp/btl_tcp_frag.c:228:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 --mca btl self,tcp -np 2 c_accumulate
[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 --mca btl self,tcp -np 2 c_accumulate
[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 --mca btl self,tcp -np 2 c_accumulate
[**ERROR**]: MPI_COMM_WORLD rank 0, file c_accumulate.c:43:
Rank 0 got 100 when it expected 201
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD 
with errorcode 1.

NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[drossetti-ivy5][[62796,1],1][../../../../../opal/mca/btl/tcp/btl_tcp_frag.c:228:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[rvandevaart@drossetti-ivy4 onesided]$

hjelmn commented 9 years ago

I wonder if the latest osc/pt2pt updates fixed this issue.

rolfv commented 9 years ago

I just retested and this problem still exists.

hjelmn commented 9 years ago

Thanks for verifying that. I will see if I can reproduce the issue.

ggouaillardet commented 9 years ago

As discussed in http://www.open-mpi.org/community/lists/devel/2015/04/17288.php here is a patch for the test case :

diff --git a/ibm/onesided/c_accumulate.c b/ibm/onesided/c_accumulate.c
index 3ca39e4..b1596ae 100644
--- a/ibm/onesided/c_accumulate.c
+++ b/ibm/onesided/c_accumulate.c
@@ -30,6 +30,8 @@ main(int argc, char *argv[])
   SendBuff = rank + 100;
   RecvBuff = 0;

+  MPI_Barrier(MPI_COMM_WORLD);
+
   /* Accumulate to everyone, just for the heck of it */

   MPI_Win_fence(MPI_MODE_NOPRECEDE, Win);

i am waiting for a confirmation MPI_Win_fence does not imply MPI_Barrier before i push this to the ompi-tests/ibm repo

hjelmn commented 9 years ago

I believe it was discussed on the mailing list awhile ago that MPI_Win_fence does not imply a barrier. In this case it is just a no-op in osc/pt2pt.

hjelmn commented 9 years ago

If I am wrong I will be happy to update master and 1.8.

hjelmn commented 9 years ago

Yup. MPI-3.0 page 443 lines 44-47:

A fence call usually entails a barrier synchronization: a process completes a call to
MPI_WIN_FENCE only after all other processes in the group entered their matching call.
However, a call to MPI_WIN_FENCE that is known not to end any epoch (in particular, a
call with assert equal to MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.

hjelmn commented 9 years ago

@ggouaillardet Go ahead and push the change to the test.

hjelmn commented 9 years ago

Can we consider this fixed?

ggouaillardet commented 9 years ago

i think so, did you read Kawashima-san's comments on the devel ML http://www.open-mpi.org/community/lists/devel/2015/04/17294.php ?

kawashima-fj commented 9 years ago

I think this is a bug in pt2pt OSC regarding MPI_MODE_NOPRECEDE, described in my mail.

The easiest fix will be:

diff --git a/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c b/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
index e925747..16d9ebd 100644
--- a/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
+++ b/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
@@ -124,6 +124,8 @@ ompi_osc_pt2pt_fence(int assert, ompi_win_t *win)
     if (0 != (assert & MPI_MODE_NOPRECEDE)) {
         OPAL_OUTPUT_VERBOSE((50, ompi_osc_base_framework.framework_output,
                              "osc pt2pt: fence end (short circuit)"));
+        ret = module->comm->c_coll.coll_barrier (module->comm,
+                                                 module->comm->c_coll.coll_barrier_module);
         return ret;
     }

hjelmn commented 9 years ago

I am ok with that change but I am still unsure about the interpretation. The problem with the test case is that it initializes a region in the window then claims there is no preceding operation but that is not the case. There is no preceding MPI RMA operation but the memory has been modified. I will see if I can clear this up with the RMA WG.

bosilca commented 9 years ago

From the standard: "MPI_MODE_NOPRECEDE — the fence does not complete any sequence of locally issued RMA calls." It doesn't talk about any local accesses to the window, only about the locally issued RMA.

This is also clarified in the "Using Advanced MPI: Modern Features of the Message-Passing Interface" page 99 ( https://books.google.com/books?id=GY5IBQAAQBAJ&pg=PA99&lpg=PA99&dq=MPI_MODE_NOPRECEDE&source=bl&ots=lH0Wj9n0MO&sig=MhD1u9HjjGi8YKrzwTcMzd4JPUM&hl=en&sa=X&ei=ruY1Va2wBqa1sQSyuIH4Bw&ved=0CEkQ6AEwBg#v=onepage&q=MPI_MODE_NOPRECEDE&f=false )

Last but not least, MPICH implementation has a similar if case ( http://trac.mpich.org/projects/mpich/browser/src/mpid/pamid/src/onesided/mpid_win_fence.c?rev=658ff0d92533793bf453fc079e2b96cf576ac1f4 ).

George.

On Mon, Apr 20, 2015 at 11:44 PM, Nathan Hjelm notifications@github.com wrote:

I am ok with that change but I am still unsure about the interpretation. The problem with the test case is that it initializes a region in the window then claims there is no preceding operation but that is not the case. There is no preceding MPI RMA operation but the memory has been modified. I will see if I can clear this up with the RMA WG.

— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/393#issuecomment-94629811.

hppritcha commented 9 years ago

I agree with George, the code as is is correct. Inserting a barrier is just causing an underlying problem in the osc/rma/tcp btl stack to be masked.

@hjelmn any update?

jsquyres commented 9 years ago

From verbal discussion: this is an issue for v1.10.0 and v2.0.0.

hppritcha commented 8 years ago

@hjelmn do you think this is fixed? Or if it is not fixed, can you fix it for v2.0.0?

hppritcha commented 8 years ago

@hjelmn if you don't think this is a blocker for v2.0.0 please remove blocker label.

hjelmn commented 8 years ago

Well, there really is nothing to fix if we agree that the no-barrier implementation of MPI_MODE_NOPRECEDE is correct. As George pointed out MPICH has the same behavior as both the osc/pt2pt and osc/rdma components so I think it is safe to keep this behavior for now.

If anyone disagrees I can put a barrier in that path and bring it over with the rest of the RMA updates after add_procs is merged.

kawashima-fj commented 8 years ago

I agree that a barrier is not necessarily required in MPI_WIN_FENCE for the MPI_MODE_NOPRECEDE case. But my understanding is that some sort of synchronization is still required before starting succeeding RMA operations.

MPI_WIN_FENCE has four roles below regarding access/exposure epochs. These are explained in MPI-3.1 page 441 lines 9-15.

complete the current access epoch (if any preceding RMA communication calls exist)
complete the current exposure epoch (if any preceding RMA communication calls exist)
start the next access epoch (if any succeeding RMA communication calls exist)
start the next exposure epoch (if any succeeding RMA communication calls exist)

MPI-3.1 has two descriptions below regarding MPI_MODE_NOPRECEDE.

page 441 lines 32-34:

MPI_MODE_NOPRECEDE - the fence does not complete any sequence of locally issued RMA calls. If this assertion is given by any process in the window group, then it must be given by all processes in the group.

page 451 lines 17-20:

A fence call usually entails a barrier synchronization: a process completes a call to MPI_WIN_FENCE only after all other processes in the group entered their matching call. However, a call to MPI_WIN_FENCE that is known not to end any epoch (in particular, a call with assert equal to MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.

MPI_MODE_NOPRECEDE only affects completing access/exposure epochs. If MPI_MODE_NOPRECEDE is specified for a MPI_WIN_FENCE call and no succeeding RMA communication calls will be issued until the next MPI_WIN_FENCE call, no synchronization (including barrier) is required. But if succeeding RMA communication calls will be issued, the target memory must be accessed after the target rank starts the next exposure epoch. This is another rule described in MPI-3.1 page 437 lines 10-11.

In active target communication, a target window can be accessed by RMA operations only within an exposure epoch.

To conform to this rule, a barrier is not necessarily required. Only synchronization between the origin and the target of the RMA communication call is sufficient. But inserting a barrier in MPI_WIN_FENCE is the easiest implementation. For pt2pt OSC, an implementation that the origin sends a message via PML without synchronization and the target receives the message and loads from/stores to its window after the corresponding MPI_WIN_FENCE call will also be OK.

I'm unsure whether my interpretation above is correct. If someone says it is incorrect, I'll follow him/her.

Regarding MPICH implementation, MPI_WIN_FENCE in the MPICH source seems to have two implementations (ch3 and pamid).

The source referred by George is pamid implementation. ch3 implementation bypasses an ibarrier on line 520 only if both MPI_MODE_NOPRECEDE and MPI_MODE_NOSUCCEED are specified in assert. Though I don't know well MPICH source, ch3 one is generic and pamid one is for IBM PAMI?

The following code shows my intended result (1) with MPICH master (ch3) but shows unintended result (0) with Open MPI master (pt2pt and rdma).

#include <stdio.h>
#include <unistd.h>
#include <mpi.h>

int main(int argc, char *argv[])
{
    int rank, obuf = 0, tbuf = 0;
    MPI_Win win;

    MPI_Init(&argc, &argv);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);
    MPI_Win_create(&tbuf, sizeof(tbuf), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);

    if (rank == 0) {
        sleep(1);
        tbuf = 1;
    }
    MPI_Win_fence(MPI_MODE_NOPRECEDE, win);
    if (rank == 1) {
        MPI_Get(&obuf, 1, MPI_INT, 0, 0, 1, MPI_INT, win);
    }
    MPI_Win_fence(MPI_MODE_NOSUCCEED, win);

    if (rank == 1) {
        printf("%d\n", obuf);
    }

    MPI_Win_free(&win);
    MPI_Finalize();

    return 0;
}

kawashima-fj commented 8 years ago

This issue is fixed by 5b9c82a9648b06364b695e199711e1c26a3afeeb in master and open-mpi/ompi-release@0e8f2675bf13dd6aa34cbf4492f92a5cddcaaf6f in v2.0 and can be closed. Right? Thanks!

jsquyres commented 8 years ago

@hjelmn @sjeaugey @bosilca Do you agree that this can be closed?

hjelmn commented 8 years ago

Yeah. I added a barrier until I can ensure no operations are complete at the target until it reaches the fence.

open-mpi / ompi

IBM test c_accumulate fails when using TCP BTL #393