taskforcesh / bullmq

BullMQ - Message Queue and Batch processing for NodeJS and Python based on Redis
https://bullmq.io
MIT License
6.25k stars 408 forks source link

[Feature Request/Question] Allow failed jobs to be retried on different workers when using linear backoff with zero delay #2789

Open dzmm opened 2 months ago

dzmm commented 2 months ago

Currently, when using linear backoff with a delay of 0, failed jobs are retried on the same worker. However, in some scenarios, a worker might be on a malfunctioning machine, and we need the ability to retry the job on a different worker.

Current Behavior

With linear backoff and zero delay, failed jobs are always retried on the same worker that initially failed to process them.

Desired Behavior

Even with linear backoff and zero delay, failed jobs should have the option to be retried on different workers, allowing for better fault tolerance and recovery from worker-specific issues.

is there anyway I can do this on current version of bullmq?

manast commented 2 months ago

I do not think that by design it will work like you are describing it.

Most likely what is happening is that since the worker just finished processing this job, and the delay is zero, it gets to pick it up. If the worker was malfunctioning it would not pick the same job again. But there is a chance that some other worker that also is idling picks it up.

In any case it would be impossible to guarantee that the same worker that failed the job would not pick it again, so there really is not a lot we can do here.

manast commented 2 weeks ago

For this to work, a worker must keep some kind of list of jobIds of recent failed jobs so that it will ignore them and thus give a chance for other workers to pick them up. It is not completely trivial to implement though, and this list of jobs must be passed to the moveToActive Lua script in every call, or be stores in some specific Redis key...