Open regrant opened 4 years ago
Isn't this a requirement for any RDMA-based implementation?
@raffenet most RDMA implementations would arrange this in the setup/start yes, but it's not guaranteed to happen by the semantics, you could call all of the pready calls on the initiator before the remote buffer was available at all and finish the buffer exchange and data payload in the wait completion. This guarantees that an accelerator like a GPU can be confident that triggerable operations on a NIC could be directly triggered and succeed from inside a kernel without knowledge of the state of the remote buffers in MPI (you know the state is guaranteed to be a certain one).
What I was asking is doesn't this benefit non-GPU use-cases? I don't see why this is specific to accelerators.
Yes, it's also an optimization for non-GPUs, the host CPU could benefit from this synchronization in some cases as well, but it does have the negative impact of reducing some opportunities for earlybird communication. So on the CPU side it's less clear if this is always better as opposed to it's use with a GPU where it is in practice going to be always better due to the slow execution of the non-synchronized code path.
@regrant makes sense. Thanks for the explanation.
Yes, it's also an optimization for non-GPUs, the host CPU could benefit from this synchronization in some cases as well, but it does have the negative impact of reducing some opportunities for earlybird communication.
Actually, I'm a still curious about this statement. The finepoints paper mentions using RDMA, but makes no mention of earlybird communication or any additional synchronization to ensure the target buffer is available. How did it work? Is the source code available?
What is the recv-side process responsibility with regard to MPI_Psync
? Does it have to make the same call? The proposed text seems to be send-side focused.
Sorry, the early work on finepoints used the term earlybird communication, here's an illustration of it. Bascially you want to move data as it becomes available, this is possible with RDMA if you know your buffer is setup. If you don't or you want to aggregate partitions you can easily hold back sending the ready portions of the local buffer and send them when the remote buffer is available or you have the parts you want aggregated. The remote buffer being setup is probably something you want in practice (out prototype implementations guaranteed it, but that's not required).
Both sides need to call Psync, it's essentially a RTS-CTS exchange (that the recv side can send the CTS before receiving a RTS if one wanted to optimize it that way).
Text in Psync description: If the user wishes to synchronize before beginning to call \mpifunc{MPI_PREADY} calls, this call must be called by both the send-side and recv-side processes for a given partitioned communication operation.
Just for my understanding, when you refer to buffer "setup" you mean an RTS-CTS exchange that happens as part of every MPI_Start
call at the sender and receiver? And MPI_Psync
is basically a way to explicitly complete this exchange before any subsequent MPI_Pready
calls?
Yes exactly, since MPI_Start is defined as local the explicit completion of the RTS-CTS exchange before MPI_Pready calls is helpful
So I think what you are saying is that finepoints blocked in the first MPI_Pready
call to ensure the receive buffer was available, but this branch+progress could be quite expensive for GPU kernels. MPI_Psync
won't eliminate the branch, but it will make it always true and avoid the need for progress.
BTW, isn't RTS-CTS more work than necessary? The sender only needs to know that the receiver is ready-to-receive to begin sending. Is there any point in letting the receiver know the sender is ready-to-send?
BTW, isn't RTS-CTS more work than necessary? The sender only needs to know that the receiver is ready-to-receive to begin sending. Is there any point in letting the receiver know the sender is ready-to-send?
Oh, I see you addressed this in your parenthetical above.
@raffenet exactly, you don't want to force progress engines to work on GPU, so this gets around that and makes everything a lot more simple and it should be a lot faster too.
@raffenet On the first use of the partitioned operation, sender and receiver need to coordinate on number of partitions etc. This could require the round trip (RTS/CTS) that @regrant mentioned.
This is a great concept ...
Implementations will have to add support for these calls. This involves implementing a CTS-RTS type handshake for synchronization.
Can an implementation No-op the sync calls? If the implementation handles the general case in Pready, it doesn't really need to implement Psync.
I think we need to differentiate between what is required for correctness vs optimization.
@sayantansur there's no reason why and implementation couldn't no-op the sync calls, so putting them in should be safe in the end application. Note you're not required to call sync either, this is just for optimization.
Hi, we should revisit this ticket soon. Can we get it into MPI-4.2?
@patrick314 -- I added you. I think we both should push this forward.
To pick up this discussion, my view is that there are two separate issues, and I wonder if it would be better to allow them to be treated separately:
Obviously (1) subsumes (2), but (1) would have to be done prior to every call to Start and while it's a potentially helpful optimization, I'm still not convinced its necessary. In addition, the desire for these isn't unique to partitioned communication - all the various stream triggered communication proposals (MPICH, HPE Two-sided, MPI-ACX enqueuing, etc.) would want this functionality to push their data movement to the one-sided path, too, right?
Is what want somethign like MPI_Match(&request)
/MPI_Imatch(&request, &matchreqeust)
for the former and MPI_Prepare(&request)
/MPI_Iprepare(&request, &preparerequest)
for the later? These would take an outstanding two-sided MPI request and cover (1) and (2) above, respectively.
Problem
This ticket introduces MPI_Pbuf_prepare, a call for guaranteeing that remote buffers are available before calling MPI_Pready for partitioned communication. This is important for optimizations to the MPI_Pready call that can be implemented on accelerators like GPUs/FPGAs.
Proposal
Introduce MPI_Pbuf_prepare and MPI_Pbuf_prepareall that provide remote buffer readiness guarantees from MPI. This enables GPU/Accelerator side MPI implementation of MPI_Pready with a single code path that is ideal for those architectures. MPI_Psync allows the MPI library to utilize accelerator triggered communications that are set up on the host CPU efficiently for kernel triggered communication. By avoiding buffer management and branching code paths, MPI_Pready and MPI_Parrived can be implemented using fast instructions on data flow centric architectures.
The proposed operation flow is as follows:
Changes to the Text
This ticket adds two calls to the partitioned communication chapter, MPI_Pbuf_prepare and MPI_Pbuf_prepareall
Impact on Implementations
Implementations will have to add support for these calls. This involves implementing a CTS-RTS type handshake for synchronization.
Impact on Users
Users will have a new mechanism to help writing code on accelerators for MPI_Pready that provides consistently optimized performance.
References
Pull request Synchronization on Partitioned Communication for Accelerator Optimization
Semantics table pull request
Please see only changes for this ticket to avoid pending partitioned communication merges.