frequent `unknown instance ID` error when running multiple backend instances

famarting commented 2 months ago

We have found an scenario where there is a transient error that happens frequently

failed to complete orchestration task: rpc error: code = Unknown desc = unknown instance ID: 5f7b2345-897d-4471-af96-6c8e590a29bf

The unknown instance ID could be considered transient, because after the server returning this error to the client, the server stops giving that error after retries, but IMO it shows a more fundamental problem with the server side implementation.

In our scenario we can run multiple instances of the server, so there are multiple grpc servers behind a load balancer. So it can happen that a request to CompleteOrchestratorTask lands in a server where there is no "pending orchestrator" to serve that request.

Here is the series of steps I went through to come to that conclusion:

first you schedule a new orchestration
on the server side the orchestration worker is eventually triggered which on ProcessWorkItem calls ExecuteOrchestrator
we continue on the server side and ExecuteOrchestrator , here https://github.com/microsoft/durabletask-go/blob/main/backend/executor.go#L100 , adds the instance id to be executed into a pendingOrchestrators map , then puts a work item into the work queue and then the function waits for the execution to complete by expecting a signal in a channel attached to the original instance stored in the pendingOrchestrators map
now on the client side, because of the work item added to the work queue, the client eventually receives the work item to execute the orchestrator and then it calls CompleteOrchestratorTask https://github.com/microsoft/durabletask-go/blob/main/backend/executor.go#L230
on the server side, if the call to CompleteOrchestratorTask is received by a different server instance than the one that originally put the instance id into the pendingOrchestrators map, then the unknown instance ID error will happen.

cgillum commented 2 months ago

Thanks - I believe I understand the issue to be a limitation in how the server APIs are implemented.

For context, the server APIs were primarily designed for use in the context of a sidecar process, in which case there is a 1:1 relationship between workers (clients) and servers. I think in order for your client/server architecture to work, you'll need to implement some kind of sticky affinity between workers and the servers that they communicate with.

famarting commented 3 weeks ago

Thank you for your response

I was thinking of what would be needed in order to support an scenario where there can be many clients and many servers, and maybe is not that big of a change...

The solution could consist of replacing the maps pendingOrchestrators and pendingActivities with new functions provided by the Backend.

    SetPendingOrchestrator(id api.InstanceID) error
    CompletePendingOrchestrator(id api.InstanceID, res *protos.OrchestratorResponse) error
    WaitForPendingOrchestrator(ctx context.Context, id api.InstanceID) (*protos.OrchestratorResponse, error)

    SetPendingActivity(id api.InstanceID) error
    CompletePendingActivity(id api.InstanceID, res *protos.ActivityResponse) error
    WaitForPendingActivity(ctx context.Context, id api.InstanceID) (*protos.ActivityResponse, error)

Then it becomes responsibility of the backend how to keep that state, for a quick implementation one could do the same as before and continue using maps to track the pending orchestrators and activities. However if the backend decides to use persistent storage for this, now calls to CompleteOrchestratorTask and CompleteActivityTask could be received in any server instance and potentially providing horizontal scaling capabilities.

Then the functions WaitForPendingOrchestrator and WaitForPendingActivity would be used from ExecuteOrchestrator and ExecuteActivity respectively.

cgillum commented 3 weeks ago

I think the approach of making work item tracking a responsibility of the backend makes sense. The tradeoff is that it makes backend development a bit more complex since backend implementations must now track work-item state themselves.

I didn't quite understand what SetPendingOrchestrator/Activity and WaitForPendingOrchestrator/Activity would be used for. Could you go into a bit more detail? Feel free to open a PR if that would help make it more clear. I was expecting we'd just need signature changes to CompleteXXXWorkItem and AbandonXXXWorkItem methods.

famarting commented 2 weeks ago

here is a PR showing the proposal https://github.com/microsoft/durabletask-go/pull/73 , I hope it helps

it may be a bit confusing, hence your comment, that now the Backend interface have functions such as:

GetXXXWorkItem, CompleteXXXWorkItem and AbandonXXXWorkItem
SetPendingXXX, CompletePendingXXX and WaitForPendingXXX

But these functions serve different purposes:

The XXXWorkItem functions are called from backend/orchestration.go and backend/activity.go which are implementations of TaskProcessor. A TaskProcessor seems to be a single process that executes in a single instance, so the calls to the backend interface to the XXXWorkItem functions can have stateful logic... because they are always going to be called from the same process.

However the PendingXXX functions are called from backend/executor.go which is the implementation of the GRPC service. This functions should not have stateful logic and they should externalize state, because due to being a GRPC service there is no guarantee that all the calls are going to be made to the same process.

makes sense? Maybe to avoid confusion the Backend interface could be splitted into multiple interface... an in process or internal interface and another interface only meant to be used from the GRPC service

famarting commented 2 weeks ago

I've just found this PR also fixes other issues I've had https://github.com/microsoft/durabletask-go/pull/61

I'll re-think the proposal to try to accomodate forthe changes in that PR

famarting commented 1 week ago

I wanted to discard my original proposal because it suggested having functions such as WaitForPendingOrchestrator which depending on the backend implementation they could poll on a database... That proposal would require very little changes to the existing implementation, but polling databases is something that we would prefer to avoid.

So then I started thinking how could we decouple the functions ExecuteOrchestrator and CompleteOrchestratorTask so it becomes technically possible to execute these functions in different server instances... however, breaking the current way of working of the grpc executor and task worker so it can be distributed across multiple servers is not as straight forward as I expected 😄 .

The functions ExecuteOrchestrator and ExecuteActivity are blocking functions that "dispatch" the work item to the work items channel and then wait for a response to be received via a callback channel. The statefulness of expecting to receive the orchestration or activity response via a callback channel is what I'm looking to get rid of. Because the fact of receiving a signal via the callback channel means that the data has originated in the same instance, and when running multiple servers there is no guarantee of that.

I already started a POC implementation where ExecuteOrchestrator and ExecuteActivity are broken into two phases: "dispatch" and "process result" ... but the refactor is turning out too large, so before I continue I think it would be better if I can discuss this topic with others

famarting commented 1 week ago

Well, I have re-iterated with our team on the proposal made here https://github.com/microsoft/durabletask-go/issues/69#issuecomment-2166387387

PR showing a rough version of it here https://github.com/microsoft/durabletask-go/pull/73

We have changed our mind and we think the idea of adding functions such as WaitForPendingXXXX is a necessary evil, as it will help keep the refactor contained and it will mean that we keep the contract of the functions ExecuteOrchestrator and ExecuteActivity intact.

From the dapr backend implementation POV we should be able to have an implementation that leverages reminders to notify the instance locked on WaitForPendingXXXX. This would solve the problem of running multiple instances of the server behind a load balancer and it would be an efficient implementation.

wdyt @cgillum

cgillum commented 1 week ago

Hi @famarting - it might help if we could create a simple visual sketch of the proposed interaction pattern between the engine and the backend (even better if we can show the before and after) just to make sure I'm correctly understanding the high-level design. There are a few details that I'm having trouble visualizing based on the text description and PR.

microsoft / durabletask-go

frequent `unknown instance ID` error when running multiple backend instances #69