restatedev / restate

Restate is the platform for building resilient applications that tolerate all infrastructure faults w/o the need for a PhD.
https://docs.restate.dev
Other
1.43k stars 34 forks source link

Investigate replacing shuffle with global total order subscription #1835

Open tillrohrmann opened 1 month ago

tillrohrmann commented 1 month ago

Currently, we are using the shuffle to send the following messages across partitions:

The ServiceInvocation message originates from a Call, OneWayCall journal entry or a Command::ProxyThrough command.

The InvocationResponse message originates from CompletePromise, CompleteAwakeable journal entries and End invoker effects.

The InvocationTermination message is created when cancelling/killing of invocations is being propagated to the children.

For a few of these messages we should be able to avoid sending them through the outbox. However for some others, I think that we still need the outbox (e.g. for the InvocationTermination message or the ProxyThrough command).

tillrohrmann commented 3 weeks ago

When applying the InvocationTermination message we scan through the journal of the invocation to detect child invocations that also need to be terminated. If child invocations exist, then the applying partition processor needs to send an InvocationTermination message to the partition processor that owns it. This happens as the pp is applying the original InvocationTermination message. Consequently, the PP either must wait until the child InvocationTermination messages have been committed to Bifrost (durably stored) or sending of these messages must go through the Outbox/shuffle.

In general, one can say that the partition processor needs the Outbox/shuffle for any message that it needs to send to another partition as a result of applying a Command unless we want to block the partition processor loop until these messages have been committed to Bifrost.

However, for messages where we already know that they are of interest to other partition processors before writing them to Bifrost (e.g. JournalEntry::Call where the chosen InvocationId defines the target partition processor), we can attach the respective Keys when writing them so that no additional Outbox/shuffle step is needed.

For messages that can have dynamically changing subscribers/recipients like the JournalEntry::Output which normally results in an InvocationResponse, it is a bit more tricky since at proposal time we might have a different set of subscribers than at application time. Therefore, we need some additional information attached to the JournalEntry::Output telling the pp to which subscribers (ResponseSink) the message was addressed so that the pp can send it to those that haven't received it yet (subscribers that registered after proposing the JournalEntry::Output).