restatedev / restate

Restate is the platform for building resilient applications that tolerate all infrastructure faults w/o the need for a PhD.
https://docs.restate.dev
Other
1.55k stars 36 forks source link

Preventing or handling of service deadlocks #199

Open tillrohrmann opened 1 year ago

tillrohrmann commented 1 year ago

Problem

With Restate's current model where we execute calls to a service instance sequentially and allow a service to await other services while keeping the lock, users can build applications that deadlock. Examples are the following:

The underlying problem is that we allow awaiting on a future result while holding the lock of a service instance.

Solution ideas

The current solution ideas fall into two categories: deadlock detection and prevention.

Deadlock detection

The idea would be to detect cyclic dependencies between service instances and to notify the user about it. Since a cyclic dependency can be formed by two service instances calling each other (potentially originating from different partitions), the detection algorithm will probably require some algorithm that retrieves this information from different partitions. This entails that the detection can probably not be a local decision.

Resolving the deadlock via undoing changes

Once a deadlock is detected, Restate needs to provide a way to resolve it. Ideally, the user could select an invocation which the runtime can cancel and whose changes will be rolled back so that the state stays consistent. It is unclear at this point in time how Restate could roll back state changes in the general case (e.g. assume that the service changed some state of another service x and that other calls were executed on x afterwards).

Deadlock prevention

If Restate's programming model wouldn't allow the creation of deadlocks, then there would be no need for a deadlock detection algorithm. Potential ideas could be the following:

Directed acyclic call graphs

The problem of deadlocks only arises if there exists a cyclic call graph. Hence, one idea to prevent this from happening is to ask the user to define a hierarchy of services where a service in level x is only allowed to call and await services in level y with y > x. That way, it would not be possible to create cycles.

Such a hierarchy would only be needed for keyed and singleton services since unkeyed services cannot be called again.

How to define a service hierarchy?

One idea for defining the service hierarchy is t ask the user to provide this information when registering a service. She could provide the level as part of the service registration. This would entail that the user has a global understanding of all the services that call her service and the services that her service will call.

Optimistic concurrency control

Since the deadlock situation arises when a service awaits another service while holding a lock, we could prevent it by disallowing awaiting another service while holding the lock. This could be done by introducing a new service type "sequentially executed keyed service" which cannot await. This means that such a service type will only be allowed to send non-blocking background calls. Additionally, calls to a keyed/singleton service instance will be run in parallel (holding no service instance lock anymore) using optimistic concurrency control for resolving conflicts. This means that the user needs to explicitly handle write conflicts by redoing/merging its changes.

jackkleeman commented 1 year ago

NB; i have a verification test that runs into an occasional deadlock, and it feels to me that it is of the more subtle kind that might be a problem for some users, because the cycle is not within one call chain (eg 1 -> 2 -> 3 -> 1) key 1 syncCalls key 2 key 2 syncCalls key 1 something else calls (sync or otherwise) key 1 and 2 concurrently So basically one acquires lock on 1, the other on 2, and then both are stuck forever. I think I will have to add some cycle detection to the verif test, as the approach currently used only looks for loops within one call chain, not the whole graph.

one of the interesting things here is that by reordering the inbox you can avoid the deadlock