Stream and Graph Based MPI Operations

mpiwg-hybrid / hybrid-issues

Repository to discuss internal hybrid working group issues

16 stars 2 forks source link

Stream and Graph Based MPI Operations #5

Open jdinan opened 3 years ago

jdinan commented 3 years ago

What

Support for enqueueing MPI operations into accelerator work queues (streams) and compute graphs.

Why

Integration of communication with the computation scheduling model for accelerators improves the programming model, can improve communication/computation overlap, and reduces overheads.

Implementations

MPI ACX
MPICH MPIX_Stream Extensions

Slides

Papers

MPIX Stream: An Explicit Solution to Hybrid MPI+X Programming. Zhou et al.

patrickb314 commented 2 years ago

Looking at the first set of slides, when MPI_Start_enqueue is called on a two-sided operation, when does the match occur, particularly with respect to other issued two-sided operations? When it's enqueued or when it's executed by the assocuated stream?

E.G. if a sender calls:

MPI_Send_init(&req[0]);
MPI_Start_enqueue(&req[0], stream);
MPI_Isend(&req[1]);

Are the sends guaranteed to be matched in the order 0 then 1 (since start_enqueue happened before the isend), or could they be matched in the other order if the match doen't happen until the stream actually executes the enqueued start?

jdinan commented 2 years ago

Matching would occur based on when the operation gets executed by the stream.

I think of streams as being similar to threads. In this case, you have a thread stream that will start the persistent send and the main process that starts the Isend. Since there's no synchronization between the two, the matching order is nondeterministic.

patrickb314 commented 2 years ago

Let me ask a slightly different question, then. Is MPI allowed to start matching when Start_enqueue is called, or does it have to wait until the stream triggers the progress engine to start matching? Is there any way the application could observe if MPI started the match early? Obviously MPI has to wait to move data in stream order, but it could be useful for implementations if the match could happen sooner.

The main challenge in doing this would be that Start_enqueue is a local operation, and matching is non-local operation, so you couldn't block Start_enqueue to do the match.

jdinan commented 2 years ago

Taking the thread analogy, the start_enqueue operation is simply putting a work descriptor into a queue, not performing an MPI operation in the traditional sense. Therefore, I would expect that MPI shouldn't start matching an enqueued operation until it's actually executed by whatever is taking work out of the queue and executing it. An exception to this would be partitioned operations, which match as soon as the persistent operation is initialized.

patrickb314 commented 2 years ago

The reason I'm asking is that if MPI can optimistically match when Start_enqueue is called, it can potentially separate the non-local matching portion of the enqueued operation from on-stream one-sided data movement.

This really comes down to how much ambiguity is in the standard. If the application cannot observe if MPI started matching when Start_enqueue is called, then MPI could match optimistically and enqueue the appropriate one-sided get or put on the stream, taking matching out of the stream's critical path. If the application can observe if MPI started the match early, however, then this API implicitly requires the stream to either perform matching itself or to synchronize with something else which performs the matching.

EDIT: spelling corrections, removed incorrect thoughts on psend matching

jdinan commented 2 years ago

I think this comes back to the discussion on "logically concurrent" ambiguity in the standard. I made an analogy between enqueued operations and threads. If we strengthened that to a semantic (i.e. each queue has the MPI semantics of a thread), then it would be subject to the "logically concurrent" discussion that @Wee-Free-Scot has been leading. Should we land on the side that "logically concurrent" means that the application can't enforce a specific ordering (e.g. by synchronization between queued operations using as an example CUDA events between streams or the host CPU), then I think we could make optimizations like you described.

To answer your last question -- with CUDA streams you can create an ordering across streams and with the host CPU using CUDA events. So, it would be possible for an application to observe that operations didn't match in the order that they attempted to create using such synchronization. I would expect the MPI communicator to still be the serialization point for matching when the same communicator is used across streams (ignoring relaxations possible with info keys). We could introduce new info assertions to optimize matching for queued operations. An interesting difference between queues and threads is that the MPI library can actually see queues because they're explicit in the API.

hzhou commented 2 years ago

Looking at the first set of slides, when MPI_Start_enqueue is called on a two-sided operation, when does the match occur, particularly with respect to other issued two-sided operations? When it's enqueued or when it's executed by the assocuated stream?

E.G. if a sender calls:
MPI_Send_init(&req[0]);
MPI_Start_enqueue(&req[0], stream);
MPI_Isend(&req[1]);
Are the sends guaranteed to be matched in the order 0 then 1 (since start_enqueue happened before the isend), or could they be matched in the other order if the match doen't happen until the stream actually executes the enqueued start?

Assume all above calls are issued on the same stream (a same communicator with serial context), then mixing immediate operations between an "enqueue" operation and the next stream "synchronization" is undefined or illegal. The usage essentially breaks the "serial" context of a stream.

An "immediate" operation essentially is an "enqueue" followed with an immediate "synchronization". This interpretation allows the "undefined" scenario. That is, it is possible with a reasonable but likely counter-intuitive outcome.

Wee-Free-Scot commented 2 years ago

Looking at the first set of slides, when MPI_Start_enqueue is called on a two-sided operation, when does the match occur, particularly with respect to other issued two-sided operations? When it's enqueued or when it's executed by the assocuated stream? E.G. if a sender calls:
MPI_Send_init(&req[0]);
MPI_Start_enqueue(&req[0], stream);
MPI_Isend(&req[1]);
Are the sends guaranteed to be matched in the order 0 then 1 (since start_enqueue happened before the isend), or could they be matched in the other order if the match doen't happen until the stream actually executes the enqueued start?
Assume all above calls are issued on the same stream (a same communicator with serial context), then mixing immediate operations between an "enqueue" operation and the next stream "synchronization" is undefined or illegal. The usage essentially breaks the "serial" context of a stream.

An "immediate" operation essentially is an "enqueue" followed with an immediate "synchronization". This interpretation allows the "undefined" scenario. That is, it is possible with a reasonable but likely counter-intuitive outcome.

Personally, I see this as sufficiently similar to forking a separate CPU thread to be able to reason about that analogous situation and draw conclusions that are valid for the enqueue situation.

What I mean is -- calling MPI_Start_enqueue is sufficiently similar to "start a pThread with MPI_Start as the function pointer (or, equivalently, a simple wrapper user function around MPI_Start)" that we substitute that code and expect the same behaviour. This substitution replaces a proposed function, with unproven/not-yet-well-defined semantics, with an extant function with well-known behaviour and outcome.

MPI_Send_init(&req[0]);
MPI_Request *ptrReq = &req[0]; pthread_create(&thread, NULL, &MPI_Start, &ptrReq);
MPI_Isend(&req[1]);

Comments:

This new code requires MPI_THREAD_MULTIPLE because there is no guard preventing concurrent execution of the MPI_Start at the time time as (or with an overlapping time interval with) the MPI_Isend.
The threading model does not define a relative order between code executing on the main thread and code executing on the pThread, so the logically concurrent rule applies.
Both camps of thought on the interpretation of the logically concurrent rule agree that this situation is one where the user cannot know or predict the ordering of the two send operations -- even if MPI sees and obeys an ordering (or imposes and enforces an ordering), the user cannot code the receive side to assume either one ordering or the other.
This is not undefined or illegal (or erroneous, to use the MPI terminology) but it does run head-on into the ambiguity around the phrase "logically concurrent", which is sub-optimal, IMHO.
According to Camp A (in the logically concurrent debate), the user could guarantee one ordering or the other by adding synchronisation code, e.g. some kind of pThread barrier in the pThread case or some kind of stream synchronisation in the stream case.
According to Camp B, the user has no remedy for the ambiguous ordering and must accept that either message could be matched before the other.
A Camp B implementation might have better performance than a Camp A implementation (tbc - open question) but a Camp A implementation is likely to be seen as easier to understand and use than a Camp B implementation.
We should strive to avoid this reliance on the interpretation of "logically concurrent" (as far as possible) in the definition of new interfaces.

hzhou commented 2 years ago

We should strive to avoid this reliance on the interpretation of "logically concurrent" (as far as possible) in the definition of new interfaces.

Agreed! We started to play around with this new idea called "MPI Stream" -- https://github.com/pmodels/mpich/discussions/5908. @Wee-Free-Scot It certainly can use some of your early input.

patrickb314 commented 2 years ago

Clarification - the match order for persistent sends is defined by when Send_init is called, not Start (or start enqueue) is called so perhaps there is no ambiguity in the matching order in the example I gave. This was a misunderstanding on my part.

Get Outlook for iOShttps://aka.ms/o0ukef

From: Hui Zhou @.> Sent: Sunday, March 27, 2022 12:49:56 PM To: mpiwg-hybrid/hybrid-issues @.> Cc: Patrick Bridges @.>; Comment @.> Subject: Re: [mpiwg-hybrid/hybrid-issues] Stream and Graph Based MPI Operations (#5)

We should strive to avoid this reliance on the interpretation of "logically concurrent" (as far as possible) in the definition of new interfaces.

Agreed! We started to play around with this new idea called "MPI Stream" -- pmodels/mpich#5908https://github.com/pmodels/mpich/discussions/5908. @Wee-Free-Scothttps://github.com/Wee-Free-Scot It certainly can use some of your early input.

— Reply to this email directly, view it on GitHubhttps://github.com/mpiwg-hybrid/hybrid-issues/issues/5#issuecomment-1079996815, or unsubscribehttps://github.com/notifications/unsubscribe-auth/ACQTKTQ34POWWHZGCZOHLNDVCCUVJANCNFSM5F5QHLFQ. You are receiving this because you commented.Message ID: @.***>

Wee-Free-Scot commented 2 years ago

the match order for persistent sends is defined by when Send_init is called, not Start (or start enqueue) is called

This is only true for partitioned operations. Regular persistent point-to-point has always matched anew each time it is active (conceptually during the starting stage -- MPI_Start or MPI_Start_all -- although in practice the protocol messages might happen any time before the associated completion stage) whereas partitioned point-to-point matches once (conceptually during the initialisation stage -- MPI_Psend_init -- although in practice the protocol messages might happen any time before the first completion stage).

MPI-4.0 p107 lines 21-23 (comparison of partitioned and regular persistent point-to-point match order):

Advice to implementors. Unlike MPI_SEND_INIT, MPI_PSEND_INIT can be matched as early as the initialization call.

MPI-4.0 p107 lines 13-15 (partitioned point-to-point match order defined by initialisation procedure order):

In the event that the communicator, tag, and source do not uniquely identify a message, the order in which partitioned communication initialization calls are made is the order in which they will eventually match.

MPI-4.0 p101 line 21 (regular persistent point-to-point must be started to permit matching):

A send operation started with MPI_START can be matched ...

MPI-4.0 p94 lines 24-27 (regular persistent point-to-point forms a half-channel):

In the case of point-to-point communication, the persistent communication request thus created can be thought of as a communication port or a “half-channel.” It does not provide the full functionality of a conventional channel, since there is no binding of the send port to the receive port.

Interestingly, I can find no slam-dunk quote from MPI-4.0 stating that regular persistent point-to-point matching order is determined by the starting procedure order. This is implied by the "half-channel" statement and by the "started with MPI_START" statement (quoted above) but there is no equivalent to the ordering statement made for nonblocking point-to-point (see below).

MPI-4.0 p74 lines 40-42:

Nonblocking communication operations are ordered according to the execution order of the calls that initiate the communication.

Side-note: we should fix this omission -- we should add a new subsubsection "3.9.1 Semantics of Persistent Communications" and state explicitly the tribal knowledge of the semantic rules pertaining to these operations.

Wee-Free-Scot commented 2 years ago

We should strive to avoid this reliance on the interpretation of "logically concurrent" (as far as possible) in the definition of new interfaces.

Agreed! We started to play around with this new idea called "MPI Stream" -- pmodels/mpich#5908. @Wee-Free-Scot It certainly can use some of your early input.

I have responded with some initial thoughts on the linked issue. Thanks for taking the time to write up your idea clearly.

patrickb314 commented 2 years ago

MPI-4.0 p74 lines 40-42:

Nonblocking communication operations are ordered according to the execution order of the calls that initiate the communication.

Thanks, Dan, this is the source of my confusion. I had initially thought matching was in Start order, read the partitioned spec which explicitly states that they match in order, wanted to make sure about regular persistent requests, read carefully about the state of MPI requests, and then saw the various quotes you provided.

From a standards terminology perspective, my uncertainty stems from whether:

matching of point to point operations happens in the order of point to point request initialization or starting. (p15 terminology)
If inactive requests made by persistent calls are initializing in the page 16 sense.
If partitioned init requests are actually initializing or initiating

This quote from p.71 seems relevant to point 2:

A persistent communication request and the handle to it are inactive if the request is not associated with any ongoing communication (see Section 3.9).

But this could just be that I’m still not fully familiar with all of the relevant terminology and latest abstractions in the standard yet. If so, please excuse me while I get up to speed with standard. It’s been many years since I waded into it in depth and lots has changed.

Wee-Free-Scot commented 2 years ago

I think some confusion is arising from trying to determine whether MPI requests are (or exhibit) operation stages, which they are (do) not.

The initialisation stage of an MPI operation creates an MPI request that represents that operation. The MPI operation is inactive (because the starting stage has not been done yet). We call the request an inactive request because it is a request that represents an inactive operation.

The starting stage of an MPI operation changes the state of the operation from inactive to active. Any request that represents this operation is now called an active request because it is a request that represents an active operation.

The completion stage of an MPI operation changes the state of the operation from active to inactive. Any request that represents this operation is now called an inactive request because it is a request that represents an inactive operation.

The freeing stage of an MPI operation deallocates/destroys the request that represents the operation.

This can be gleaned (we hope) from the state transition diagrams provided in the Terms chapter (see MPI-4.0 §2.4.1).

There is no API to discover-without-permitting-change the stage of the operation represented by a request.

Thus,

Matching for regular point-to-point operations (blocking, nonblocking, or persistent) happens once per starting stage of the operation. For blocking operations, this means a match occurs for every call to the single procedure that expresses that operation, e.g. MPI_Send. For nonblocking operations, this means a match occurs for every call to the single initiation procedure for that operation, e.g. MPI_Isend. For persistent operations, this means a match occurs for every call to the starting procedure for that operation, e.g. MPI_Start.
- Matching for partitioned point-to-point operations (persistent only) happens once per initialisation stage of the operation. This means that a match occurs for every call to the initialisation procedure for that operation, e.g. MPI_Psend_init.
- I am being very careful with the wording here -- a match occurs, permitting us to count how many receive operations are needed without bringing the ordering rules(s) into our discussion. The matching order (if any) is determined by (when it is defined at all) the order of the procedure calls that give rise to the match. This is somewhat orthogonal, although very important for programmability/predictability/performance/etc.
Persistent requests (of all types) are created inactive by the initialisation procedure, which performs the initialisation stage of the operation. No requests are initialising (category error).
Partitioned communication operations are never initiated, they are only ever initialised and, perhaps, later/separately started (the two component parts of initiated) because they are persistent operations, rather than nonblocking operations. No requests are initialising or initiating (category error).

patrickb314 commented 2 years ago

The distinction between the state of MPI requests and the state of MPI operations was indeed my main source of confusion - I realized that right before your message came in, but thank you for the clarification. That said, I'm not sure the standard is clear on whether matching happens as part of initializing an operation (when all information needed for matching is available), or starting an operation (when the data buffers become available).

Would not this formulation also capture what the standard is trying to do?

MPI point-to-point communications are matched when the operation is initialized, which may or may not be the same as when the request is initialized.
MPI non-partitioned init persistent calls don't initialize an MPI operation because the request is inactive - they only initialize a request object. As a result, non-partitioned _init calls do not result in the matching of an operation.
MPI partitioned persistent init calls initialize an MPI operation (as well as a request object). As a result, partitioned init calls do result in the matching of an MPI operation

That is, the standard is ambiguous on whether MPI operations match when they are initialized (and all necessary information is available to match them) or when they are starting. This could be resolved, including the distinction between non-partitioned and partitioned communications, either by:

Having matching occur at different points in the lifecycle of MPI operations depending on whether or not they are partitioned (your proposal).
Having matching always occur at operation initialization but clarifying that non-partitioned init calls not initiate an MPI communication, only an (inactive) MPI request (my alternative above).

Wee-Free-Scot commented 2 years ago

An MPI request is a representation of an MPI operation; the state of a request and the state-changes of a request are tied to the state and the state-changes of the represented operation. An MPI request cannot be initialised without performing the initialisation stage of the MPI operation that it represents. Divorcing these two does not seem to be helpful or necessary.
the fact that an MPI request exists means that (at least) the initialisation stage of the represented MPI operation has been performed and that the freeing stage of that MPI operation has not been performed. An MPI request is inactive precisely because (and exactly when) the represented MPI operation is inactive.
correct as stated

The standard is not as clear as I would like (okay, that means ambiguous, doesn't it) about whether the matching order for persistent point-to-point operations is determined by their initialisation procedure calls or their starting procedure calls. It is, however, common knowledge that latter is the correct interpretation and the former interpretation would cause a great deal of surprise to all users and implementors.

It is currently the case that the ordering for different types of MPI operation depends on different stages within the lifecycle of those operations (and, therefore, depend on different MPI procedure calls for different types of operation).
Personally, I wish that regular persistent point-to-point had been defined such that the matching order depended on the initialisation stage of the operation and not on the starting stage. This would, IMHO, make regular persistent point-to-point operations more useful and would permit a greater range of optimisation for these operations. All other types of persistent operations work that way (match order depends on initialisation stage), so regular point-to-point is the odd-one-out. This is not currently the case, however, and the MPI Forum is unlikely to vote for something that would involve such a huge surprise to existing users and implementors.