timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
2.15k stars 160 forks source link

pgBoss.stop doesn't remove active jobs #303

Open dolegi opened 2 years ago

dolegi commented 2 years ago

Hey, first off thanks so much for pgboss is an extremely useful library!

when calling pgBoss.stop() and waiting for the stopped event, jobs that take longer than the timeout get stuck in an active state.

What currently happens

We have some singleton jobs that run for between ~10mins up to just over 1hour. So we have set them to only expire after 120 minutes. When we re-deploy our job workers, the active jobs stay in pgboss until they expire, so the job doesn't get re-triggered until the active job (that no worker is working on) expires.

Request

Ideally when re-deploying we can catch the SIGTERM, call pgboss.stop({timeout: x}) which will stop the worker and remove any active jobs.

TL;DR Request

Have pgBoss.stop() delete/update active jobs when the worker stops.

Or should we be manually deleting active jobs, by tracking jobId's and manually updating the pgboss.job table. Is there a recommended way to approach this?

Related issues

https://github.com/timgit/pg-boss/issues/268

Thanks!

timgit commented 2 years ago

Hey, thanks! I agree with your suggestion, which is pretty similar to the expiration promise that is started along with jobs in the worker. I will look into an ideal way of opting into this.

Also, have you considered listening to SIGTERM in your worker callback function to do your own failure?

dolegi commented 2 years ago

Hi tim, thanks for looking into it. We are considering updating the job statuses directly but it feels wrong and against the way to properly work with pgboss.

UPDATE pgboss.jobs SET state = '<abandoned>' where state=active and id in <ids from the instance worker>;

We have to be careful to only update the job ids from the current instance, since other instance workers could still be actively processing jobs.

StarpTech commented 1 year ago

Hi @timgit any updates on this?

timgit commented 1 year ago

No work is being planned for this request right now. There is a reason SQS doesn't allow you to hold on to a message for hours, first of all. But long-running promises aside, I think the best approach would be to fail the jobs after the timeout. They would be eligible for retry at that point by another worker.

timgit commented 1 year ago

I'll consider adding this into v10