How to ensure ordering of jobs in case of delayed retries.

ksyd9821 commented 6 months ago

Hello,

for our current use case, we are adding jobs to a queue and would like them to be successfully processed in the same order they were added (FIFO). So when a job fails, the desired outcome would be that the job is retried automatically after some delay before moving on to the next job on the queue.

For example, if we have these 3 jobs in our queue [1, 2, 3] with job 1 being the first added job, here is what a possible execution would look like:

process job 1
job 1 fails
X amount of delay
process job 1 again
job 1 completed
process job 2
process job 3

How can we achieve this with bullmq?

Thank you in advance for the support!

manast commented 6 months ago

Thank you for your question. I am afraid that currently, when a job fails, the queue is not halted, so the other jobs waiting to be processed will be processed as soon as a worker is free. How critical is this case for you? can you develop a bit more the whole scenario where this functionality is needed?

hardcodet commented 6 months ago

@manast

How critical is this case for you? can you develop a bit more the whole scenario where this functionality is needed?

It is critical, unfortunately. Our use case is a number of event queues for webhooks (each queue representing a customer's subscription), where we would like to submit events in proper order. We see that in practice, webhooks are sometimes not working (e.g. the customer's endpoint is temporary available) and need to be retried, but we can't have those events move to the back of the queue, because order matters.

As a dummy example: imagine two events occurring in this order:

The system is offline
The system is online

If we would send these events in inverted order, the outcome on the customer end would be completely wrong, since they assumed the system is offline, and might cease communication to it.

manast commented 6 months ago

Ok, so this function would be specific for groups, where a group would not continue processing new jobs until the previous one have been completely completed or failed, furthermore this feature would only make sense with concurrency equal 1. We need to study to see how feasible this feature is in current design.

hardcodet commented 6 months ago

You're right. We're already using concurrency of 1 extensively to enforce sequential processing because there's a lot of cases for us that warrant that. Preserving order on retries is just one flavor more.

If it's a bad fit for BullMQ, we could work around the issue with the following strategy I guess:

handle the error ourselves, and
- pause the queue
- mark the failed job as completed
- create a new job with the same payload and enqueue it LIFO
re-enable the queue after the retry delay

This is absolutely feasible for us. We just figured that ordered processing (including retries with backoff delays) would be a common scenario, so we wanted to discuss this with you first 👍

hardcodet commented 6 months ago

It wouldn't be specific for groups though: we thought about creating a queue for each customer (rather than groups with a customer ID), which would reduce the complexity for the retries remarkably compared to queues that still would have to process events for other groups.

manast commented 6 months ago

We are working on a solution for this in BullMQ and then we will extend it to groups as well, this is the PR: https://github.com/taskforcesh/bullmq/pull/2465

hardcodet commented 5 months ago

You guys rock! Looking forward to the implementation :)

Adam-Burke commented 2 months ago

Just wondering how this is progressing. I'm processing ordered sports facts from third parties and being able to block at the group level would be fantastic.

manast commented 2 months ago

@Adam-Burke yes, we have this PR almost ready. The biggest issue I see with this is that despite everything the order cannot be guaranteed as long as you have more than one worker, as even though they will pick the jobs in order, due to network latencies and such, it is possible that one worker will start processing a job before the other one that is running in a different machine or process.

Adam-Burke commented 2 months ago

Could there be a way to ensure that jobs from the same group would always be processed by the same worker (assuming it's still running). So you could still scale out workers but have group-based, at-least-once, ordered processing?

Either way I think it still is quite useful for our purposes.

manast commented 2 months ago

@Adam-Burke Let's see. If you used groups with max concurrency 1, then it is guaranteed that only 1 job will be processed per group, therefore order is guaranteed within a group, except for the fail case with retries. So if we supported this case, (keep order within a group for retries), would that solve your use case?

UPDATE: sorry for the confusion, now I see that this issue is exactly about this... so yes, basically we will support this case soon.

rnevet-reply commented 2 months ago

Hi, We are also just facing this issue, I also see that the PR is in progress, can someone estimate how much work/time is left on this? can I be optimistic that this new feature will be soon available? when more or less would be super helpful? Thanks and sorry for nagging!

taskforcesh / bullmq-pro-support

How to ensure ordering of jobs in case of delayed retries. #68