Open meyer9 opened 7 months ago
@timgit I was thinking about this exact addition to pg-boss
. We have a scenario as the OP describes where a worker process exits in an uncontrolled way leaving a job in the active
state. There doesn't seem to be a way for other workers to know about this and therefore we end up with jobs stuck in limbo.
As the OP mentioned a heartbeat column in the jobs table would allow for a maintenance task or other workers to detect the stuck job and perhaps move it into the retry
state so it can again be processed.
What are your thoughts on this? I am happy to work on this update/feature as we have a a fairly urgent need. It should also benefit other pg-boss
users too. 😄
+1 to this, would love to see pg-boss better recognize when a job has stopped for reasons such as the worker process crashing.
If we had a field
last_heartbeat
, we could more quickly detect when a job fails due to process crashes, etc where it may not properly update to failed.One example would be to send a heartbeat every 15 seconds (setting the column to
NOW()
) and marking a job as failed/expired if the heartbeat is over 60 seconds old.