timgit / pg-boss

Queueing jobs in Node.js using PostgreSQL like a boss
MIT License
1.73k stars 144 forks source link

Job heartbeat support #436

Open meyer9 opened 7 months ago

meyer9 commented 7 months ago

If we had a field last_heartbeat, we could more quickly detect when a job fails due to process crashes, etc where it may not properly update to failed.

One example would be to send a heartbeat every 15 seconds (setting the column to NOW()) and marking a job as failed/expired if the heartbeat is over 60 seconds old.

Crispy1975 commented 2 months ago

@timgit I was thinking about this exact addition to pg-boss. We have a scenario as the OP describes where a worker process exits in an uncontrolled way leaving a job in the active state. There doesn't seem to be a way for other workers to know about this and therefore we end up with jobs stuck in limbo.

As the OP mentioned a heartbeat column in the jobs table would allow for a maintenance task or other workers to detect the stuck job and perhaps move it into the retry state so it can again be processed.

What are your thoughts on this? I am happy to work on this update/feature as we have a a fairly urgent need. It should also benefit other pg-boss users too. 😄

schester44 commented 1 month ago

+1 to this, would love to see pg-boss better recognize when a job has stopped for reasons such as the worker process crashing.