Problem

With Restate's current model where we execute calls to a service instance sequentially and allow a service to await other services while keeping the lock, users can build applications that deadlock. Examples are the following:

A service that awaits on a call to itself directly or indirectly (via a chain of other service calls)
Two services A and B which call each other (directly or indirectly) on the same key and await the result

The underlying problem is that we allow awaiting on a future result while holding the lock of a service instance.

Solution ideas

The current solution ideas fall into two categories: deadlock detection and prevention.

Deadlock detection

The idea would be to detect cyclic dependencies between service instances and to notify the user about it. Since a cyclic dependency can be formed by two service instances calling each other (potentially originating from different partitions), the detection algorithm will probably require some algorithm that retrieves this information from different partitions. This entails that the detection can probably not be a local decision.

Resolving the deadlock via undoing changes

Once a deadlock is detected, Restate needs to provide a way to resolve it. Ideally, the user could select an invocation which the runtime can cancel and whose changes will be rolled back so that the state stays consistent. It is unclear at this point in time how Restate could roll back state changes in the general case (e.g. assume that the service changed some state of another service x and that other calls were executed on x afterwards).

Deadlock prevention

If Restate's programming model wouldn't allow the creation of deadlocks, then there would be no need for a deadlock detection algorithm. Potential ideas could be the following:

Directed acyclic call graphs

The problem of deadlocks only arises if there exists a cyclic call graph. Hence, one idea to prevent this from happening is to ask the user to define a hierarchy of services where a service in level x is only allowed to call and await services in level y with y > x. That way, it would not be possible to create cycles.

Such a hierarchy would only be needed for keyed and singleton services since unkeyed services cannot be called again.

How to define a service hierarchy?

One idea for defining the service hierarchy is t ask the user to provide this information when registering a service. She could provide the level as part of the service registration. This would entail that the user has a global understanding of all the services that call her service and the services that her service will call.

Optimistic concurrency control

Since the deadlock situation arises when a service awaits another service while holding a lock, we could prevent it by disallowing awaiting another service while holding the lock. This could be done by introducing a new service type "sequentially executed keyed service" which cannot await. This means that such a service type will only be allowed to send non-blocking background calls. Additionally, calls to a keyed/singleton service instance will be run in parallel (holding no service instance lock anymore) using optimistic concurrency control for resolving conflicts. This means that the user needs to explicitly handle write conflicts by redoing/merging its changes.

restatedev / restate

Preventing or handling of service deadlocks #199