[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures

sisve commented 6 years ago

Laravel Version: 5.6
PHP Version: irrelevant
Database Driver & Version: irrelevant

Description:

The new redis blocking pop functionality introduced in https://github.com/laravel/framework/pull/22284 has issues if there are problems between the BLPOP and the ZADD. This means that any failures in the worker at this point will lose the jobs that have been popped from the queue. These jobs will not be retried by another worker since they were never pushed to the reserved queue.

The problematic code is in RedisQueue::blockingPop.

https://github.com/laravel/framework/blob/63b61080b1383f8570526f85caf09dfea24c146d/src/Illuminate/Queue/RedisQueue.php#L225-L241

taylorotwell commented 6 years ago

So what is the solution?

taylorotwell commented 6 years ago

Just throwing ideas out there, can we use MULTI and EXEC?

taylorotwell commented 6 years ago

Apparently not:

taylorotwell commented 6 years ago

Seems like our only option would be to go full Lua script with it.

halaei commented 6 years ago

Beside documentation, as I mentioned here, the code to fix will need to use brpoplpush command. It will add another Redis list key for each queue, it will add 1 or 2 extra requests per pop, and it will reverse the order of jobs in the Redis list.

halaei commented 6 years ago

Please take a look at https://redis.io/commands/rpoplpush#pattern-reliable-queue

taylorotwell commented 6 years ago

Ah, hmm, we may need to back this out then if we can't make it reliable. Any thoughts? If we need to use brpoplpush that is fine with me. But, it sounds like reversing the order of jobs in the Redis list is a major breaking change? Or we would document that changing from non-blocking to blocking requires you to drain your queues first so the order is correct.

taylorotwell commented 6 years ago

I personally feel like data loss on a queue is going to be critical in basically 99% of applications.

sisve commented 6 years ago

If we change the ordering of the queues we should probably change both the blocking and the non-blocking order. I don't think that we should require people to drain their queues just to try out the new features.

I don't see a huge problem with changing the order of the lists between 5.5 and 5.6. Sure, I can make up some scenarios where it will cause problems, but those involve a mix of Laravel versions pushing and popping to the queue, and not enough workers to ever drain it. And that is a totally different problem anyway.

I think it's enough to document that we change the order of the redis queue in the upgrade guide. It's an implementation detail and only affects people if they have jobs in the queue when upgrading.

halaei commented 6 years ago

Helper scripts can be provided to do the migration and reverse the orders.

taylorotwell commented 6 years ago

Looks like I'll just need to document this for 5.6 release I guess.

taylorotwell commented 6 years ago

Added this warning for now:

@halaei do you intend to work on fixing the potential data loss issue or should others try to take a look at it?

halaei commented 6 years ago

I work on it and send a PR in 2 days.

sisve commented 6 years ago

An update;

The original PR did a check for a configuration setting of blocking_for > 0 to activate this new feature. That meant it was an opt-in feature.
A later commit changed this behavior to check for non-null values, allowing a blocking_for = 0.
Horizon defaults to a blocking_for=0 in code if the entry is missing in the configuration. This affects all upgraded installations. I have no numbers on how many these are, and it depends on if people updated their config/queue.php files to add the blocking_for entry. There is no explicit mention of this in the upgrade guide.

All these three factors, together, becomes a growing problem of reported lost/paused jobs in Horizon.

On an unrelated note, if we're using blocking_for=0 (or any high value), how does that work with the SIGTERM handling to cleanly shut down a worker? I imagine that the signal is sent, but it isn't processed until the BLPOP returns. However, if this takes too long time, then supervisor would have issued a SIGKILL to forcefully shut down the worker.

Assuming it happens during the BLPOP ; does redis know that the call was interrupted or would it still pop a job, send it away, and perhaps realize the connection is dead too late? Do we know what happens here?
If this happens between the BLPOP and the ZADD, the job is lost. This is basically a form of the reported problem.
If it happens after ZADD the job is retried later.

There are several questions about the current implementation, but there is a new PR that uses BRPOPLPUSH at https://github.com/laravel/framework/pull/23057 that may fix this. I'm not knowledgeable enough about BRPOPLPUSH to evaluate that approach, do we have any takers that can review that PR and add some input?

halaei commented 5 years ago

I wrote a Redis module to fix this issue and add some other values. Please check lqrm and its php driver. Here is an intro digest:

This is a Redis module that implements a laravel compatible queue driver for redis with the following improvements over the original Laravel 5.7 PHP-Lua driver:

Blocking` pop is now more reliable than before.

Blocking pop now works on delayed and reserved jobs as well.

Timer for delayed and reserved jobs is server side, with milliseconds precision. This means you don't need to worry about syncing your php and redis servers, in case your projects are distributed accross different servers. Moreover, this makes retry_after and block_for configurations independent of each other.

Laravel queue can now be available for other programming languages and frameworks as well. Feel free to port it to your favorite ones.

Will be great if you give it a try and/or send feedback - it might still have some bugs though.

matt-allan commented 5 years ago

There is a pretty simple solution that I think will work. It's explained in the redis docs here.

Basically when we push a job we also push a notification key:

MULTI
RPUSH queues:default '{"id":12345,"attempts":1,"job":"say_hello","data":{"name":"Matt"}}'
RPUSH queues:default:notify x
EXEC

The worker does a blocking pop on the notification key. Once the notification key returns an element we know a job is ready. Once a job is ready we use the existing Lua script and get all of the same atomicity guarantees.

LOOP forever
    WHILE EVAL(pop.lua, queues:default, queues:default:reserved, expiration) returns elements
        ... process elements ...
    END
    BRPOP queues:default:notify
END

In this example queues:default:notify is a new list used only for notifying workers a new job was pushed. There would be a new list for each queue using the pattern queues:{{name}}:notify.

There is a chance the worker might crash between popping the notification element and popping the job onto the reserved queue. That is ok - once the worker restarts it will attempt to pop the job again and nothing important is lost.

The only downside I can think of with this approach is it will take some care when upgrading. If you update the workers before the producers they will be blocking waiting for a notification key that is never set. The worker will eventually work the job once block_for times out, so job processing won't stop completely. I think in most cases you would upgrade all of your servers at the same time, so this would be a rare edge case.

driesvints commented 5 years ago

This will be fixed in 5.8: https://github.com/laravel/framework/pull/27336

Thanks for everyone's help with this!

laravel / framework

[5.6] The new blocking pop feature for redis queues can lose jobs in case of worker failures #22939

Description: