Closed rolfv closed 8 years ago
This is a CUDA-aware only issue. Taking this issue.
After further investigation, this has nothing to do with CUDA-aware. It turns out that this is an intermittent problem and that tripped me up. Such a simple test. I am going to give this back to Nathan.
Just retested with master checkout from March 17, 2015. I can still see the error but it is intermittent. Here is an example:
[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 --mca btl self,tcp -np 2 c_accumulate
[**ERROR**]: MPI_COMM_WORLD rank 0, file c_accumulate.c:43:
Rank 0 got 100 when it expected 201
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[drossetti-ivy5][[62811,1],1][../../../../../opal/mca/btl/tcp/btl_tcp_frag.c:228:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 --mca btl self,tcp -np 2 c_accumulate
[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 --mca btl self,tcp -np 2 c_accumulate
[rvandevaart@drossetti-ivy4 onesided]$ mpirun --host drossetti-ivy4,drossetti-ivy5 --mca btl self,tcp -np 2 c_accumulate
[**ERROR**]: MPI_COMM_WORLD rank 0, file c_accumulate.c:43:
Rank 0 got 100 when it expected 201
--------------------------------------------------------------------------
MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
with errorcode 1.
NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
You may or may not see output from other processes, depending on
exactly when Open MPI kills them.
--------------------------------------------------------------------------
[drossetti-ivy5][[62796,1],1][../../../../../opal/mca/btl/tcp/btl_tcp_frag.c:228:mca_btl_tcp_frag_recv] mca_btl_tcp_frag_recv: readv failed: Connection reset by peer (104)
[rvandevaart@drossetti-ivy4 onesided]$
I wonder if the latest osc/pt2pt updates fixed this issue.
I just retested and this problem still exists.
Thanks for verifying that. I will see if I can reproduce the issue.
As discussed in http://www.open-mpi.org/community/lists/devel/2015/04/17288.php here is a patch for the test case :
diff --git a/ibm/onesided/c_accumulate.c b/ibm/onesided/c_accumulate.c
index 3ca39e4..b1596ae 100644
--- a/ibm/onesided/c_accumulate.c
+++ b/ibm/onesided/c_accumulate.c
@@ -30,6 +30,8 @@ main(int argc, char *argv[])
SendBuff = rank + 100;
RecvBuff = 0;
+ MPI_Barrier(MPI_COMM_WORLD);
+
/* Accumulate to everyone, just for the heck of it */
MPI_Win_fence(MPI_MODE_NOPRECEDE, Win);
i am waiting for a confirmation MPI_Win_fence does not imply MPI_Barrier before i push this to the ompi-tests/ibm repo
I believe it was discussed on the mailing list awhile ago that MPI_Win_fence does not imply a barrier. In this case it is just a no-op in osc/pt2pt.
If I am wrong I will be happy to update master and 1.8.
Yup. MPI-3.0 page 443 lines 44-47:
A fence call usually entails a barrier synchronization: a process completes a call to
MPI_WIN_FENCE only after all other processes in the group entered their matching call.
However, a call to MPI_WIN_FENCE that is known not to end any epoch (in particular, a
call with assert equal to MPI_MODE_NOPRECEDE) does not necessarily act as a barrier.
@ggouaillardet Go ahead and push the change to the test.
Can we consider this fixed?
i think so, did you read Kawashima-san's comments on the devel ML http://www.open-mpi.org/community/lists/devel/2015/04/17294.php ?
I think this is a bug in pt2pt OSC regarding MPI_MODE_NOPRECEDE, described in my mail.
The easiest fix will be:
diff --git a/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c b/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
index e925747..16d9ebd 100644
--- a/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
+++ b/ompi/mca/osc/pt2pt/osc_pt2pt_active_target.c
@@ -124,6 +124,8 @@ ompi_osc_pt2pt_fence(int assert, ompi_win_t *win)
if (0 != (assert & MPI_MODE_NOPRECEDE)) {
OPAL_OUTPUT_VERBOSE((50, ompi_osc_base_framework.framework_output,
"osc pt2pt: fence end (short circuit)"));
+ ret = module->comm->c_coll.coll_barrier (module->comm,
+ module->comm->c_coll.coll_barrier_module);
return ret;
}
I am ok with that change but I am still unsure about the interpretation. The problem with the test case is that it initializes a region in the window then claims there is no preceding operation but that is not the case. There is no preceding MPI RMA operation but the memory has been modified. I will see if I can clear this up with the RMA WG.
From the standard: "MPI_MODE_NOPRECEDE — the fence does not complete any sequence of locally issued RMA calls." It doesn't talk about any local accesses to the window, only about the locally issued RMA.
This is also clarified in the "Using Advanced MPI: Modern Features of the Message-Passing Interface" page 99 ( https://books.google.com/books?id=GY5IBQAAQBAJ&pg=PA99&lpg=PA99&dq=MPI_MODE_NOPRECEDE&source=bl&ots=lH0Wj9n0MO&sig=MhD1u9HjjGi8YKrzwTcMzd4JPUM&hl=en&sa=X&ei=ruY1Va2wBqa1sQSyuIH4Bw&ved=0CEkQ6AEwBg#v=onepage&q=MPI_MODE_NOPRECEDE&f=false )
Last but not least, MPICH implementation has a similar if case ( http://trac.mpich.org/projects/mpich/browser/src/mpid/pamid/src/onesided/mpid_win_fence.c?rev=658ff0d92533793bf453fc079e2b96cf576ac1f4 ).
George.
On Mon, Apr 20, 2015 at 11:44 PM, Nathan Hjelm notifications@github.com wrote:
I am ok with that change but I am still unsure about the interpretation. The problem with the test case is that it initializes a region in the window then claims there is no preceding operation but that is not the case. There is no preceding MPI RMA operation but the memory has been modified. I will see if I can clear this up with the RMA WG.
— Reply to this email directly or view it on GitHub https://github.com/open-mpi/ompi/issues/393#issuecomment-94629811.
I agree with George, the code as is is correct. Inserting a barrier is just causing an underlying problem in the osc/rma/tcp btl stack to be masked.
@hjelmn any update?
From verbal discussion: this is an issue for v1.10.0 and v2.0.0.
@hjelmn do you think this is fixed? Or if it is not fixed, can you fix it for v2.0.0?
@hjelmn if you don't think this is a blocker for v2.0.0 please remove blocker label.
Well, there really is nothing to fix if we agree that the no-barrier implementation of MPI_MODE_NOPRECEDE is correct. As George pointed out MPICH has the same behavior as both the osc/pt2pt and osc/rdma components so I think it is safe to keep this behavior for now.
If anyone disagrees I can put a barrier in that path and bring it over with the rest of the RMA updates after add_procs is merged.
I agree that a barrier is not necessarily required in MPI_WIN_FENCE
for the MPI_MODE_NOPRECEDE
case. But my understanding is that some sort of synchronization is still required before starting succeeding RMA operations.
MPI_WIN_FENCE
has four roles below regarding access/exposure epochs. These are explained in MPI-3.1 page 441 lines 9-15.
MPI-3.1 has two descriptions below regarding MPI_MODE_NOPRECEDE
.
page 441 lines 32-34:
MPI_MODE_NOPRECEDE
- the fence does not complete any sequence of locally issued RMA calls. If this assertion is given by any process in the window group, then it must be given by all processes in the group.
page 451 lines 17-20:
A fence call usually entails a barrier synchronization: a process completes a call to
MPI_WIN_FENCE
only after all other processes in the group entered their matching call. However, a call toMPI_WIN_FENCE
that is known not to end any epoch (in particular, a call with assert equal toMPI_MODE_NOPRECEDE
) does not necessarily act as a barrier.
MPI_MODE_NOPRECEDE
only affects completing access/exposure epochs. If MPI_MODE_NOPRECEDE
is specified for a MPI_WIN_FENCE
call and no succeeding RMA communication calls will be issued until the next MPI_WIN_FENCE
call, no synchronization (including barrier) is required. But if succeeding RMA communication calls will be issued, the target memory must be accessed after the target rank starts the next exposure epoch. This is another rule described in MPI-3.1 page 437 lines 10-11.
In active target communication, a target window can be accessed by RMA operations only within an exposure epoch.
To conform to this rule, a barrier is not necessarily required. Only synchronization between the origin and the target of the RMA communication call is sufficient. But inserting a barrier in MPI_WIN_FENCE
is the easiest implementation. For pt2pt OSC, an implementation that the origin sends a message via PML without synchronization and the target receives the message and loads from/stores to its window after the corresponding MPI_WIN_FENCE
call will also be OK.
I'm unsure whether my interpretation above is correct. If someone says it is incorrect, I'll follow him/her.
Regarding MPICH implementation, MPI_WIN_FENCE
in the MPICH source seems to have two implementations (ch3
and pamid
).
The source referred by George is pamid
implementation. ch3
implementation bypasses an ibarrier on line 520 only if both MPI_MODE_NOPRECEDE
and MPI_MODE_NOSUCCEED
are specified in assert
. Though I don't know well MPICH source, ch3
one is generic and pamid
one is for IBM PAMI?
The following code shows my intended result (1) with MPICH master (ch3) but shows unintended result (0) with Open MPI master (pt2pt and rdma).
#include <stdio.h>
#include <unistd.h>
#include <mpi.h>
int main(int argc, char *argv[])
{
int rank, obuf = 0, tbuf = 0;
MPI_Win win;
MPI_Init(&argc, &argv);
MPI_Comm_rank(MPI_COMM_WORLD, &rank);
MPI_Win_create(&tbuf, sizeof(tbuf), 1, MPI_INFO_NULL, MPI_COMM_WORLD, &win);
if (rank == 0) {
sleep(1);
tbuf = 1;
}
MPI_Win_fence(MPI_MODE_NOPRECEDE, win);
if (rank == 1) {
MPI_Get(&obuf, 1, MPI_INT, 0, 0, 1, MPI_INT, win);
}
MPI_Win_fence(MPI_MODE_NOSUCCEED, win);
if (rank == 1) {
printf("%d\n", obuf);
}
MPI_Win_free(&win);
MPI_Finalize();
return 0;
}
This issue is fixed by 5b9c82a9648b06364b695e199711e1c26a3afeeb in master and open-mpi/ompi-release@0e8f2675bf13dd6aa34cbf4492f92a5cddcaaf6f in v2.0 and can be closed. Right? Thanks!
@hjelmn @sjeaugey @bosilca Do you agree that this can be closed?
Yeah. I added a barrier until I can ensure no operations are complete at the target until it reaches the fence.
The ibm/onesided/c_accumulate test fails when run over TCP BTL. I do not observe the failure over openib or sm BTL. I also do not observe it on 1.8. This is only on master.