maxcountryman / underway

⏳ Durable step functions via Postgres.
Apache License 2.0
72 stars 2 forks source link

refactor dequeue transaction handling #55

Closed maxcountryman closed 2 weeks ago

maxcountryman commented 2 weeks ago

This refactor encapsulates pending task dequeue operations within their own transactions, updating the task row state to prevent duplicate processing by other callers. By doing so, we ensure that state changes are immediately visible, accurately reflecting task ownership throughout its processing lifecycle.

Additionally, tasks now include a configurable heartbeat interval and a record of the last heartbeat. Workers periodically update the task row to indicate ongoing liveness. Should a task’s heartbeat become stale, the dequeue method can select it for reassignment.

It's important to note that a missed heartbeat alone does not definitively indicate task abandonment, as a worker might resume processing after a temporary delay. To guard against this type of partial failure, workers also acquire a transaction-level advisory lock on the task ID. As long as a worker's transaction remains active, this lock prevents other workers from processing the same task, ensuring exclusive ownership and consistent processing even across intermittent failures.

A notable benefit of these changes is that task progress states are fully utilized and in-progress tasks are visible globally. Furthermore, transaction overhead is reduced as a dequeue's transaction is only held for the duration of obtaining an available task. That said, a second transaction is still maintained for the duration of execution so long-running tasks still benefit from decomposition into e.g. multiple job steps.

maxcountryman commented 2 weeks ago

@kirillsalykin this addresses the fact that "in-progress" hasn't be used and increases visibility overall. I think this is also somewhat closer to how e.g. pg-boss approaches things.

kirillsalykin commented 2 weeks ago

I might be wrong (please correct me if so), but this change reduces consistency. Imagine this scenario - work being started, task marked as in_progress, then the worker get killed w/o possibility to update the database (for instance connection drops) and task stays in in_progress state - it will never be picked up again...

UPDATED: ah, I see what you did here, if task misses hearbeat - it considered failed.

sorry for the noice, clear now!

PS I would read description before making comments

kirillsalykin commented 2 weeks ago

this makes code (and approach) slightly more complicated, but I think not keeping tx opened for the duration of task is a good thing...

maxcountryman commented 2 weeks ago

Agree on all points--I wish it weren't more complex of course but I think it's worth it for the observability gained. I also trust that projects like pg-boss have had longer to mature and so there's probably good reasons for some of their more significant design decisions which we can also benefit from.