microsoft / durabletask-go

The Durable Task Framework is a lightweight, embeddable engine for writing durable, fault-tolerant business logic (orchestrations) as ordinary code.
Apache License 2.0
178 stars 25 forks source link

Issue with distributed traces and multiple instances of the worker #59

Closed balchua closed 5 months ago

balchua commented 5 months ago

I noticed a strange behavior when i run multiple instances of the worker (say 3) all pointing to the same database. Currently took the postgres implementation with some changes.

Here's the screenshot strange-traces

You can see that it contains several orchestration:SimpleOrchestration spans. This happens also to activities.

I also see several of these logs.

{"time":"2024-01-13T14:55:54.491837674+08:00","level":"ERROR","msg":"orchestration-processor: failed to complete work item: instance 'db1659b0-1528-4042-a500-0cb3822f2cad' no longer exists or was locked by a different worker"}
{"time":"2024-01-13T14:55:54.497473338+08:00","level":"ERROR","msg":"orchestration-processor: failed to abandon work item: lock on work-item was lost"}

I think this happens while the other workers are all processing the work items, while one of them has already transitioned or completed the work item.

balchua commented 5 months ago

I found the issue here, the different executors / workers instances are picking up the same rows from the database. The way I fixed it is to use SELECT FOR UPDATE SKIP LOCKED, this way, those items already picked up by other process will simply be ignored by the other process. This is done on both GetOrchestrationWorkItem and GetActivityWorkItem.

The code is in this repo.
This repo is a copy of the original PR #33, just wanted to see how it works with postgres db.