Progress between RMA/P2P/Collectives

devreal commented 3 years ago

Since not everyone in the RMA WG participates in the Terms WG and the discussion of progress rules will be handled there, I wanted to get a feel for what people here think about how progress of RMA operations and by RMA synchronization calls should be defined in the future. The current wording is not precise on which guarantees users can expect in terms of progress of outstanding non-RMA communication operations when calling MPI_Win_test for example or progress of RMA operations when calling non-RMA procedure calls.

Here are two examples:

Example 1: Progress of non-RMA operations in RMA calls:

if (rank == 0) {
  MPI_Request sreq, rreq;
  MPI_Isend(&large_buffer, &sreq);
  MPI_Rput(..., &rreq);
  while (!flag) MPI_Win_test(&flag);
  MPI_Wait(&sreq);
} else if (rank ==1) {
  MPI_Recv(&large_buffer);
  MPI_Win_complete(...);
}

Question: should MPI_Win_test guarantee progress of the send?

Example 2: Progress of RMA operations in non-RMA calls:

if (rank ==0) {
  MPI_Request areq, breq;
  MPI_Raccumulate(..., &areq); // assuming operation not supported by HW, may fall back to AM
  while (!flag) MPI_Test(&areq, &flag); // assuming we can do something useful in between
  MPI_Send(large_message);
} else if (rank == 1) {
  MPI_Request rreq;
  MPI_Irecv(large_message, ..., &rreq); // complete the barrier before completing the RMA epoch
  while (!flag) MPI_Test(&rreq); // assuming we can do something useful in between
}

Here, the only option I can see for completion is for the test on the receive request to progress the accumulate operations.

From a user perspective, I expect both programs to be correct since I am continuously calling into MPI, giving the implementation a chance to progress any outstanding operations that the operations I am polling on might depend on. Of course, the MPI implementation has no knowledge of such dependencies.

In previous discussions (https://github.com/mpi-forum/mpi-issues/issues/499) that point was raised that the RMA synchronization functions should not have to progress non-RMA operations to avoid the added latency, which would render the first example incorrect.

The question now is: what expectations do people have in terms of progress inside RMA synchronization functions? I see three options:

1) Guaranteed progress of any outstanding operation: a call to any RMA synchronization function has to ensure progress of outstanding non-RMA operations (if possible, of course). This may not be required on every call but the observable behavior should be that eventually all operations complete. 2) No progress guarantees in RMA: RMA and non-RMA progress is unidirectional: RMA synchronization operations are not required to progress non-RMA operations. Isolation in both directions is likely not feasible due to passive target RMA.

Example 3: Guaranteed progress

if (rank == 0) {
  int signal = 0;
  MPI_Request sreq;
  MPI_Isend(large_message, 1, &sreq);
  while (!signal) { // poll for a signal to be set by rank 1
    MPI_Get(&signal, myrank, ...); 
    MPI_Win_flush_local(myrank); // the get completes immediately
  }
  MPI_Wait(&sreq);
} else if (rank == 1) {
  int signal = 1;
  MPI_Recv(large_message, 0);
  MPI_Put(&signal, 0);
  MPI_Win_flush(0);
}

Without progress guarantees from MPI_Win_flush_local, the application would be required to test on sreq to ensure completion of the send. If the send and the get were issued by different user libraries the application would have to ensure that the progress dependencies are correctly handled (on top of the data dependencies of initiated operations). Similarly, what behavior is expected when instead of using MPI_Win_flush_local we use MPI_Rget+MPI_Wait? Do we expect progress of the send even though the local get completed immediately?

I can see the argument that RMA communication is esp. latency sensitive and any additional progress may be costly. On the other hand, implementations may have some leeway to limit progress of non-RMA operations to every N'th call, reducing the impact on latency while ensuring eventual completion of communication dependencies.

In any case, the MPI standard should clearly outline what constitutes a correct program. Ignoring potential breakage of existing software, we could define the expectations from the RMA point of view either way and I would appreciate any input from the RMA working group :)

hjelmn commented 3 years ago

No. IMHO RMA calls should not be required to progress two-sided communication. The first program should be considered erroneous and should check the two-sided request as part of the loop.

Now, it is also my opinion that active messsge RMA would be removed from the standard. This issue should go away if both PSCW and fence go away.

jeffhammond commented 3 years ago

I already gave my opinion here: https://github.com/mpi-forum/mpi-issues/issues/499#issuecomment-847381070

wgropp commented 3 years ago

I agree that RMA progress and other progress should not be linked. The intent has always been to let users exploit special hardware where possible (and reasonable). Assuming, whether explicitly or implicitly, that there is a single progress agent, can add significant and unnecessary overhead to all RMA operations.

Users can write correct programs without this stronger requirement. It may be that there are some programs that may run on some systems and not others. This has been true from the beginning - programs that assume message buffering will often run for short messages but deadlock for others. Calls to RMA routines may cause general progress but should not require that.

Bill

William Gropp Director, NCSA Thomas M. Siebel Chair in Computer Science University of Illinois Urbana-Champaign IEEE-CS President-Elect

On Jun 4, 2021, at 8:31 AM, Nathan Hjelm @.***> wrote:

No. IMHO RMA calls should not be required to progress two-sided communication. The first program should be considered erroneous and would check the two-sided request as part of the loop.

Now, it is also my opinion that active messsge RMA would be removed from the standard. This issue should go away if both PSCW and fence go away.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/mpiwg-rma/rma-issues/issues/17*issuecomment-854726714__;Iw!!DZ3fjg!v-lLL5IeEJ99j6LBU3DDAf4FLj6zeitCZWqwG6cjzy13slTeVerrRH4GjGu_cKTKBQ$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ADJFGZV2RQKU5U26BAKBS2LTRDIUPANCNFSM46CLBEWA__;!!DZ3fjg!v-lLL5IeEJ99j6LBU3DDAf4FLj6zeitCZWqwG6cjzy13slTeVerrRH4GjGsKFxf7sw$.

jdinan commented 3 years ago

Hi Bill,

Is the following allowed to deadlock because MPI_Win_fence does not progress the Isend on process 0?

Process 0: MPI_Isend to process 1 MPI_Win_fence

Process 1: MPI_Recv from process 0 MPI_Win_fence

~Jim.

On Sun, Jun 6, 2021 at 4:11 PM William Gropp @.***> wrote:

I agree that RMA progress and other progress should not be linked. The intent has always been to let users exploit special hardware where possible (and reasonable). Assuming, whether explicitly or implicitly, that there is a single progress agent, can add significant and unnecessary overhead to all RMA operations.

Users can write correct programs without this stronger requirement. It may be that there are some programs that may run on some systems and not others. This has been true from the beginning - programs that assume message buffering will often run for short messages but deadlock for others. Calls to RMA routines may cause general progress but should not require that.

Bill

William Gropp Director, NCSA Thomas M. Siebel Chair in Computer Science University of Illinois Urbana-Champaign IEEE-CS President-Elect

On Jun 4, 2021, at 8:31 AM, Nathan Hjelm @.***> wrote:

No. IMHO RMA calls should not be required to progress two-sided communication. The first program should be considered erroneous and would check the two-sided request as part of the loop.

Now, it is also my opinion that active messsge RMA would be removed from the standard. This issue should go away if both PSCW and fence go away.

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub < https://urldefense.com/v3/__https://github.com/mpiwg-rma/rma-issues/issues/17*issuecomment-854726714__;Iw!!DZ3fjg!v-lLL5IeEJ99j6LBU3DDAf4FLj6zeitCZWqwG6cjzy13slTeVerrRH4GjGu_cKTKBQ$>, or unsubscribe < https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/ADJFGZV2RQKU5U26BAKBS2LTRDIUPANCNFSM46CLBEWA__;!!DZ3fjg!v-lLL5IeEJ99j6LBU3DDAf4FLj6zeitCZWqwG6cjzy13slTeVerrRH4GjGsKFxf7sw$ .

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/mpiwg-rma/rma-issues/issues/17#issuecomment-855455988, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ5P4QCZZMHPQCIOWMP5BLTRPJADANCNFSM46CLBEWA .

rsth commented 3 years ago

FWIW, the progress section of the RMA chapter says on pg 621, ln 5, of MPI-4: "MPI implementations must guarantee that a process makes progress on all enabled communications it participates in, while blocked on an MPI call. This is true for send-receive communication and applies to RMA communication as well."

jdinan commented 3 years ago

Thanks for finding this. We could introduce an assertion for RMA synchronization operations or an info assertion on the window to restrict the scope of progress to RMA operations. It would be helpful if someone could measure the overheads this can save.

On Sat, Jun 12, 2021 at 12:43 PM Rajeev Thakur @.***> wrote:

FWIW, the progress section of the RMA chapter says on pg 621, ln 5, of MPI-4: "MPI implementations must guarantee that a process makes progress on all enabled communications it participates in, while blocked on an MPI call. This is true for send-receive communication and applies to RMA communication as well."

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/mpiwg-rma/rma-issues/issues/17#issuecomment-860078251, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZ5P4WACYWEU5EDX6SGRM3TSOFDLANCNFSM46CLBEWA .

wgropp commented 3 years ago

Note that this doesn’t say anything about MPI_WIN_FENCE making progress on pt-2-pt communication - it just says that progress happens. There are lots of ways to ensure that, including a separate progress thread for the different kinds of communication.

In general, we need to describe the semantics, not the implementation. Then assertions can be used to help the implementation make various tradeoffs, e.g., for systems with weak (but low latency) progress guarantees.

mpiwg-rma / rma-issues

Progress between RMA/P2P/Collectives #17