MPI Continuations Proposal

Note

The below description reflects an earlier version of the MPI Continuations proposal and is kept for historical purposes. The current version of the proposal can be found in https://github.com/mpiwg-hybrid/mpi-standard/pull/1 and in the following PDF: https://github.com/mpiwg-hybrid/mpi-standard/files/14565813/continuations_202403011.pdf

Background

MPI provides support for all sorts of non-blocking operations (pt2pt, collectives, RMA, I/O), each returning a request object that can be used to test and wait for the completion of the operation. Once an operation is complete, applications typically react to that change in state, e.g., by deallocating the use buffer, processing the received message, or starting subsequent operations. The required polling on the requests is impractical for applications that are able to overlap communication with additional work, such as processing available tasks. Request management may become cumbersome and error-prone esp in multi-threaded applications.

Proposal

This proposal introduces a flexible interface for attaching so-called continuations to operation requests. Continuations are actions that are invoked by the MPI library once the completion of an operation is detected. A maximum of one continuation may be attached to any request object and the MPI implementtion takes back the ownership of any non-persistent request and no copy of the request may be used to test/wait for the completion of the operation. Persistent requests remain valid but may be used to test/wait for the operation to complete after the continuation has been attached but no second continuation may be attached to it before its completion. It is unspecified whether the continuation has completed execution when a call to MPI_Test/MPI_Wait on a persistent operation request. Execution of the continuation may be deferred to a later point.

Continuations may be attached to a single operation request (MPI_Continue) or a set of requests (MPI_Continueall):

typedef void MPI_Continue_cb_function(MPI_Status *array_of_statuses, void *cb_data);

int MPI_Continue(
  MPI_Request *op_request, 
  MPI_Continue_cb_function cb,
  void *cb_data,
  MPI_Status *status,
  MPI_Request cont_request);

int MPI_Continueall(
  int count,
  MPI_Request array_of_op_requests[],
  MPI_Continue_cb_function cb,
  void *cb_data,
  MPI_Status array_of_statuses[],
  MPI_Request cont_request)

The latter will cause the continuation to be invoked once all of the provided operations have completed. For each operation request, a status may be provided that will be set before the continuation is invoked. The provided buffer containing the status(es) will be passed to the continuation callback, along with the provided cb_data pointer. MPI_STATUS_IGNORE/MPI_STATUSES_IGNORE may be passed instead to the registration function, which would then be passed to the callback instead.

Continuation Requests

The continuation is attached to the operation request(s) and registered to the continuation request (cont_request above). Continuation requests are allocated using MPI_Continue_init:

int MPI_Continue_init(MPI_Info info, MPI_Request *cont_req);

Continuation requests accumulate outstanding continuations and can be used to test/wait for their completion. Continuation request may themselves have a continuation attached to them, which will be invoked once all registered continuations have completed executing. They can also be used to progress outstanding continuations by calling MPI_Test on them.

Continuation request are persistent but are not started explicitly. Instead, continuation requests are started implicitly when the first continuation is registered after initialization or previous completion.

Execution Context

By default, continuations may be invoked by any application thread calling into the MPI library. Two info keys for calls to MPI_Continue_init are provided to restrict the execution:

"mpi_continue_poll_only": if set to "true" continuations are only invoked when MPI_Test or MPI_Wait is called on the continuation request with which the continuations are registered. (default: "false", i.e., the continuation may be executed at any time)
"mpi_continue_thread": may be "application" (only application threads may execute continuations) or "any" (any thread may execute continuations, incl. MPI progress threads, if availabe). (default: "application")

Further Info Keys

"mpi_continue_enqueue_complete": if "true" and upon attaching a continuation to a set of requests all operation are complete, the continuation is enqueued for later execution (e.g., while polling for on the continuation request). Otherwise, continuations may be executed immediately inside the call to MPI_Continue/MPI_Continueall if all operations were immediately complete. (default: "false")
"mpi_continue_max_poll": the maximum number of continuations to execute when polling (calling MPI_Test) on the continuation request. (default: "-1", i.e., as many as possible)
"mpi_continue_async_signal_safe": if true, the continuation is async-signal-safe and may be called from within a signal handler. (default: "false")

Resources

The current PDF: mpi40-report-continuations.pdf

Proposal PR: TBD

Open Questions

A list of open questions (to be used to track discussions):

Integration with Sessions

How to connect continuation requests to a session? Do we need MPI_Session_continue_init?
Is it legal to register continuations for requests from different sessions with the same continuation request?

Status handling

Can we use the same callback signature for MPI_Continue and MPI_Continueall, given that one would be passed MPI_STATUS_IGNORE and the other MPI_STATUSES_IGNORE?

General

Should we pass the number of completed requests to the callback function?
Is there any use for something like MPI_Continueany (potentially more resource efficient by reusing the same data structure for several continuations) or MPI_Continuesome (what would the semantics be?)
Should the execution of a continuation be required for the completion of a persistent request?

After the discussion at the 01/26/2022 virtual meeting I've mulled over different options for how to integrate persistent continuations for persistent operations (i.e., continuations that remain attached to persistent operations after they've executed). The model I came up with retains the hidden property of continuations and provides semantics for moving requests from one continuation to another and for freeing a subset of requests without jeopardizing correctness (at least I like to believe so).

Here is the updated API:

int MPI_Continue(
  MPI_Request *op_req,
  MPI_Continue_cb_function cb,
  void *cb_data,
  int flags,
  MPI_Status *status,
  MPI_Request cont_req);
int MPI_Continueall(
  int count,
  MPI_Request op_req[],
  MPI_Continue_cb_function cb,
  void *cb_data,
  int flags,
  MPI_Status status[],
  MPI_Request cont_req);

Notice that MPI_Continue and MPI_Continueall now take a flags argument. The value is either 0 or an OR-combination of MPI_CONT_IMMEDIATE (the continuation may be executed immediately if all operations are complete; otherwise it is scheduled for later execution) and MPI_CONT_PERSISTENT. The second flag is the interesting part here and marks the continuation as persistent. If this flag is set, the continuation remains attached to each operation until

1) The request is freed; or 2) Another continuation is attached to the request while the operation is inactive.

In the case of MPI_Continueall, if one request is removed from the persistent continuation, the continuation remains attached to all other requests. Once all requests have been freed or moved to another continuation, the continuation which was attached to them disappears and will never be executed again. While it is legal to attach the first continuation to an active request, it is erroneous to attach a continuation to a request while it is active (note that for non-persistent requests, it is always erroneous to attach another continuation because they are only observed in the active state).

Starting a persistent operation arms the continuation so that it will trigger once all relevant operations have armed it as well and completed. Thus, reattaching another continuation to an active operation potentially leads to a race condition and is thus erroneous.

A continuation may be executed once all requests tied to it

1) have transitioned from active to inactive; or 2) have been freed.

This ensures that no continuation is executed while any operation tied to it has not left the inactive state.

Non-persistent continuations disappear once they were executed and persistent operations will have no continuation attached to them afterwards. It is possible to attach persistent continuations to non-persistent operations. After the continuation has executed, the non-persistent requests will have been freed, leaving the persistent continuation attached to only the persistent operations in the set of operations (if any). If a persistent continuation was attached to only non-persistent operations the continuation behaves as if it was a non-persistent continuation (it disappears once all non-persistent operations have been freed).

It is not possible to reattach a continuation to a subset of operations. Instead, a new continuation should be created. The cost will be similar to trying to tie a subset of operations to an existing continuation. The solution above is cleaner as it does not explicitly expose continuation objects to the application space.

mpiwg-hybrid / hybrid-issues