timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.04k stars 157 forks source link

Avoid expiring long-running jobs #238

Closed gavacho closed 3 years ago

gavacho commented 3 years ago

If I have configured pg-boss to consider jobs expired after 15m and one of my jobs takes 60m then that job always gets marked as expired even if the handler eventually resolves/rejects. This means that the alert we set up to inform us about expired jobs is raising a lot of false alarms.

It would be useful to us if we could signal that work on a job is on-going and the job is not actually expiring.

timgit commented 3 years ago

Failure via expiration in pg-boss works in a similar architectural pattern as the visibility timeout in AWS SQS. If you fetch a job from an SQS job, then fail to delete it within this timeout, it becomes available again for another worker to fetch it. SQS maintenance doesn't have access to the memory space where you're executing a process with the payload from a job. I'm mentioning this first since you may find docs related to retries and idempotency useful in regards to your use cases, and they have a lot more docs than I do. :)

Similarly, maintenance operations in pg-boss are isolated sql queries that look for specific things, like expiration conditions. It doesn't have access to node.js memory/state, so it can't make any assumptions about what may or may not be running in a worker in this instance (especially useful when you run multiple instances)

Having said all that, I think you should either extend your expiration config on your jobs, or include something like a Promise.race() in your handler with a package like delay to throw an error after a configured timeout. This would at least guarantee your handlers won't run unbounded if that is a concern.

gavacho commented 3 years ago

I wouldn't want the system to make assumptions about a job still being in progress. But, if my job handler were provided with a function I could call to indicate "hey, i'm still working here" and if that resulted in a job.heartbeaton timestamptz getting updated then pg-boss could use that value when determining which jobs should be expired.

timgit commented 3 years ago

Even if pg-boss were to have a feature that deferred the configured expiration, there should be a maximum amount of time a job is allowed to be in active state, which then becomes "yet another second expiration" feature.

It seems that the best approach is to adjust expiration to a larger value in your case so that it can fit all normal execution times. If you need progress reporting during job execution, that's something that could be implemented without the need of the queue.

Hopefully I'm understanding your goal here. If you don't want to worry about pg-boss ever marking a job as expired, for example, just set it to expire in a really large interval like a week.

gavacho commented 3 years ago

I missed that I can configure this via publish options. Thanks!