regrant commented 4 years ago

Problem

This ticket introduces MPI_Pbuf_prepare, a call for guaranteeing that remote buffers are available before calling MPI_Pready for partitioned communication. This is important for optimizations to the MPI_Pready call that can be implemented on accelerators like GPUs/FPGAs.

Proposal

Introduce MPI_Pbuf_prepare and MPI_Pbuf_prepareall that provide remote buffer readiness guarantees from MPI. This enables GPU/Accelerator side MPI implementation of MPI_Pready with a single code path that is ideal for those architectures. MPI_Psync allows the MPI library to utilize accelerator triggered communications that are set up on the host CPU efficiently for kernel triggered communication. By avoiding buffer management and branching code paths, MPI_Pready and MPI_Parrived can be implemented using fast instructions on data flow centric architectures.

The proposed operation flow is as follows:

Changes to the Text

This ticket adds two calls to the partitioned communication chapter, MPI_Pbuf_prepare and MPI_Pbuf_prepareall

Impact on Implementations

Implementations will have to add support for these calls. This involves implementing a CTS-RTS type handshake for synchronization.

Impact on Users

Users will have a new mechanism to help writing code on accelerators for MPI_Pready that provides consistently optimized performance.

References

Pull request Synchronization on Partitioned Communication for Accelerator Optimization

Semantics table pull request

Please see only changes for this ticket to avoid pending partitioned communication merges.

raffenet commented 4 years ago

Isn't this a requirement for any RDMA-based implementation?

regrant commented 4 years ago

@raffenet most RDMA implementations would arrange this in the setup/start yes, but it's not guaranteed to happen by the semantics, you could call all of the pready calls on the initiator before the remote buffer was available at all and finish the buffer exchange and data payload in the wait completion. This guarantees that an accelerator like a GPU can be confident that triggerable operations on a NIC could be directly triggered and succeed from inside a kernel without knowledge of the state of the remote buffers in MPI (you know the state is guaranteed to be a certain one).

raffenet commented 4 years ago

What I was asking is doesn't this benefit non-GPU use-cases? I don't see why this is specific to accelerators.

regrant commented 4 years ago

Yes, it's also an optimization for non-GPUs, the host CPU could benefit from this synchronization in some cases as well, but it does have the negative impact of reducing some opportunities for earlybird communication. So on the CPU side it's less clear if this is always better as opposed to it's use with a GPU where it is in practice going to be always better due to the slow execution of the non-synchronized code path.

raffenet commented 4 years ago

@regrant makes sense. Thanks for the explanation.

raffenet commented 4 years ago

Yes, it's also an optimization for non-GPUs, the host CPU could benefit from this synchronization in some cases as well, but it does have the negative impact of reducing some opportunities for earlybird communication.

Actually, I'm a still curious about this statement. The finepoints paper mentions using RDMA, but makes no mention of earlybird communication or any additional synchronization to ensure the target buffer is available. How did it work? Is the source code available?

raffenet commented 4 years ago

What is the recv-side process responsibility with regard to MPI_Psync? Does it have to make the same call? The proposed text seems to be send-side focused.

regrant commented 4 years ago

Sorry, the early work on finepoints used the term earlybird communication, here's an illustration of it. Bascially you want to move data as it becomes available, this is possible with RDMA if you know your buffer is setup. If you don't or you want to aggregate partitions you can easily hold back sending the ready portions of the local buffer and send them when the remote buffer is available or you have the parts you want aggregated. The remote buffer being setup is probably something you want in practice (out prototype implementations guaranteed it, but that's not required).

regrant commented 4 years ago

Both sides need to call Psync, it's essentially a RTS-CTS exchange (that the recv side can send the CTS before receiving a RTS if one wanted to optimize it that way).

Text in Psync description: If the user wishes to synchronize before beginning to call \mpifunc{MPI_PREADY} calls, this call must be called by both the send-side and recv-side processes for a given partitioned communication operation.

raffenet commented 4 years ago

Just for my understanding, when you refer to buffer "setup" you mean an RTS-CTS exchange that happens as part of every MPI_Start call at the sender and receiver? And MPI_Psync is basically a way to explicitly complete this exchange before any subsequent MPI_Pready calls?

regrant commented 4 years ago

Yes exactly, since MPI_Start is defined as local the explicit completion of the RTS-CTS exchange before MPI_Pready calls is helpful

raffenet commented 4 years ago

So I think what you are saying is that finepoints blocked in the first MPI_Pready call to ensure the receive buffer was available, but this branch+progress could be quite expensive for GPU kernels. MPI_Psync won't eliminate the branch, but it will make it always true and avoid the need for progress.

BTW, isn't RTS-CTS more work than necessary? The sender only needs to know that the receiver is ready-to-receive to begin sending. Is there any point in letting the receiver know the sender is ready-to-send?

raffenet commented 4 years ago

BTW, isn't RTS-CTS more work than necessary? The sender only needs to know that the receiver is ready-to-receive to begin sending. Is there any point in letting the receiver know the sender is ready-to-send?

Oh, I see you addressed this in your parenthetical above.

regrant commented 4 years ago

@raffenet exactly, you don't want to force progress engines to work on GPU, so this gets around that and makes everything a lot more simple and it should be a lot faster too.

jdinan commented 4 years ago

@raffenet On the first use of the partitioned operation, sender and receiver need to coordinate on number of partitions etc. This could require the round trip (RTS/CTS) that @regrant mentioned.

sayantansur commented 4 years ago

This is a great concept ...

sayantansur commented 4 years ago

Implementations will have to add support for these calls. This involves implementing a CTS-RTS type handshake for synchronization.

Can an implementation No-op the sync calls? If the implementation handles the general case in Pready, it doesn't really need to implement Psync.

I think we need to differentiate between what is required for correctness vs optimization.

regrant commented 4 years ago

@sayantansur there's no reason why and implementation couldn't no-op the sync calls, so putting them in should be safe in the end application. Note you're not required to call sync either, this is just for optimization.

tonyskjellum commented 3 months ago

Hi, we should revisit this ticket soon. Can we get it into MPI-4.2?

tonyskjellum commented 3 months ago

@patrick314 -- I added you. I think we both should push this forward.

patrickb314 commented 3 months ago

To pick up this discussion, my view is that there are two separate issues, and I wonder if it would be better to allow them to be treated separately:

Ensuring that MPI two-sided requests are matched so that we know the exposed buffer on which to make operations.
Ensuring or asserting that the remote buffer for one-sided data movement.

Obviously (1) subsumes (2), but (1) would have to be done prior to every call to Start and while it's a potentially helpful optimization, I'm still not convinced its necessary. In addition, the desire for these isn't unique to partitioned communication - all the various stream triggered communication proposals (MPICH, HPE Two-sided, MPI-ACX enqueuing, etc.) would want this functionality to push their data movement to the one-sided path, too, right?

Is what want somethign like MPI_Match(&request)/MPI_Imatch(&request, &matchreqeust) for the former and MPI_Prepare(&request)/MPI_Iprepare(&request, &preparerequest) for the later? These would take an outstanding two-sided MPI request and cover (1) and (2) above, respectively.

mpi-forum / mpi-issues

Synchronization on Partitioned Communication for Accelerator Optimization (Pbuf_Prepare et al) #302

Problem

Proposal

Changes to the Text

Impact on Implementations

Impact on Users

References