timgit / pg-boss

Queueing jobs in Postgres from Node.js like a boss
MIT License
1.95k stars 153 forks source link

retry behaviour #290

Closed basaran closed 2 years ago

basaran commented 2 years ago

Hello again :)

I'm planning (at least trying) for a deployment where there will be multiple workers. Let's say 4 workers.

Workers will basically process user uploaded files. The files will be synced through some syncing/mirroring solution, where each worker instance will have the same files eventually but with a delay for sure.

Let's say the worker on server-A picked up a processing job, but the it hasn't got the required file yet. Knowing it doesn't have the file, worker kindly rejects the work, and job fails.

Now, the job has a retry setting of 4. What are the odds of server-A picking up the same job it rejected in the next 3 retries? Is there anything I can do to avoid that happening?

Hope you have a nice week coming up, thank you.

P.S Besides fetch

basaran commented 2 years ago

By the way, I already tested this and it appears to be doing round-robin, just asking if there is any gotchas that could detonate on my face later :)

timgit commented 2 years ago

There isn't any round-robin logic in pg-boss for assigning work since the architecture is pull-based.

basaran commented 2 years ago

I launched three instances and subscribed to the same queue, and I set the handlerFn to fail to observe the behaviour. Job was set to retry 3 times, and each time a different worker picked up the task. Is it just random?

timgit commented 2 years ago

Yes. The workers are polling the queue, so it's not predictable which one will get it

basaran commented 2 years ago

I will try to modify the subscriber options to include an acceptance condition, would you be interested in a PR if I manage to do so? A callback function to execute before accepting the job.

timgit commented 2 years ago

Once a job is fetched, you can immediately throw an error to kick it back for a retry, but again, you aren't guaranteed that worker won't get that job on the next fetch. You may actually need more than 1 queue if there's a reason a worker can't process it.

basaran commented 2 years ago

If we just keep rejecting there is a chance the job would never make it to the origin data if sync fails somehow.

This is what I thought of so far:

a. modify the handlerFn, to reduce the retry count by one - directly on job table. Before the rejecting inside. This would be easiest?

b. Modify the subscription method, to accept another callback function, modify the function where you pull the args out, modify the onFetch inside watch method to include the callback or something. This would be a good learning experience but I dont know if this would be useful to other pg-boss users.

c. Instead of doing the additional callback, I can possibly modify pg-boss to do something with the rejected data, and not increase the retryCount if the rejected promise returns reject({ soft: true}).

If you would consider a PR, I can go at the direction you point.

timgit commented 2 years ago

What if you just increased the delay between retries or use exponential backoff?

basaran commented 2 years ago

That would work too, but I'm thinking if we add a soft reject option that won't increase the retry count, (which could be run together with exponential backoff) it would have other uses. For instance, there are some nodejs libraries for checking network, cpu utilization and so on. Users can do soft reject if the worker cpus are busy and etc. Do you think reducing the retrycount manually is a bad idea?

timgit commented 2 years ago

Yes, please avoid updating the job table directly. In my opinion, failing a job and republishing with the original payload if needed gives you much more control for advanced conditions like this.

basaran commented 2 years ago

got it, I noticed the retryCount is increased during fetch. I learned quite a bit. Thank you and let's close this for now. If I come up with a solid way I will let you know. Have a nice weekend.