Closed sisve closed 5 years ago
So what is the solution?
Just throwing ideas out there, can we use MULTI and EXEC?
Apparently not:
Seems like our only option would be to go full Lua script with it.
Beside documentation, as I mentioned here, the code to fix will need to use brpoplpush
command. It will add another Redis list key for each queue, it will add 1 or 2 extra requests per pop, and it will reverse the order of jobs in the Redis list.
Please take a look at https://redis.io/commands/rpoplpush#pattern-reliable-queue
Ah, hmm, we may need to back this out then if we can't make it reliable. Any thoughts? If we need to use brpoplpush
that is fine with me. But, it sounds like reversing the order of jobs in the Redis list is a major breaking change? Or we would document that changing from non-blocking to blocking requires you to drain your queues first so the order is correct.
I personally feel like data loss on a queue is going to be critical in basically 99% of applications.
If we change the ordering of the queues we should probably change both the blocking and the non-blocking order. I don't think that we should require people to drain their queues just to try out the new features.
I don't see a huge problem with changing the order of the lists between 5.5 and 5.6. Sure, I can make up some scenarios where it will cause problems, but those involve a mix of Laravel versions pushing and popping to the queue, and not enough workers to ever drain it. And that is a totally different problem anyway.
I think it's enough to document that we change the order of the redis queue in the upgrade guide. It's an implementation detail and only affects people if they have jobs in the queue when upgrading.
Helper scripts can be provided to do the migration and reverse the orders.
Looks like I'll just need to document this for 5.6 release I guess.
Added this warning for now:
@halaei do you intend to work on fixing the potential data loss issue or should others try to take a look at it?
I work on it and send a PR in 2 days.
An update;
config/queue.php
files to add the blocking_for entry. There is no explicit mention of this in the upgrade guide.All these three factors, together, becomes a growing problem of reported lost/paused jobs in Horizon.
On an unrelated note, if we're using blocking_for=0 (or any high value), how does that work with the SIGTERM handling to cleanly shut down a worker? I imagine that the signal is sent, but it isn't processed until the BLPOP returns. However, if this takes too long time, then supervisor would have issued a SIGKILL to forcefully shut down the worker.
There are several questions about the current implementation, but there is a new PR that uses BRPOPLPUSH at https://github.com/laravel/framework/pull/23057 that may fix this. I'm not knowledgeable enough about BRPOPLPUSH to evaluate that approach, do we have any takers that can review that PR and add some input?
I wrote a Redis module to fix this issue and add some other values. Please check lqrm and its php driver. Here is an intro digest:
This is a Redis module that implements a laravel compatible queue driver for redis with the following improvements over the original Laravel 5.7 PHP-Lua driver:
- Blocking` pop is now more reliable than before.
- Blocking pop now works on delayed and reserved jobs as well.
- Timer for delayed and reserved jobs is server side, with milliseconds precision. This means you don't need to worry about syncing your php and redis servers, in case your projects are distributed accross different servers. Moreover, this makes retry_after and block_for configurations independent of each other.
- Laravel queue can now be available for other programming languages and frameworks as well. Feel free to port it to your favorite ones.
Will be great if you give it a try and/or send feedback - it might still have some bugs though.
There is a pretty simple solution that I think will work. It's explained in the redis docs here.
Basically when we push a job we also push a notification key:
MULTI
RPUSH queues:default '{"id":12345,"attempts":1,"job":"say_hello","data":{"name":"Matt"}}'
RPUSH queues:default:notify x
EXEC
The worker does a blocking pop on the notification key. Once the notification key returns an element we know a job is ready. Once a job is ready we use the existing Lua script and get all of the same atomicity guarantees.
LOOP forever
WHILE EVAL(pop.lua, queues:default, queues:default:reserved, expiration) returns elements
... process elements ...
END
BRPOP queues:default:notify
END
In this example queues:default:notify
is a new list used only for notifying workers a new job was pushed. There would be a new list for each queue using the pattern queues:{{name}}:notify
.
There is a chance the worker might crash between popping the notification element and popping the job onto the reserved queue. That is ok - once the worker restarts it will attempt to pop the job again and nothing important is lost.
The only downside I can think of with this approach is it will take some care when upgrading. If you update the workers before the producers they will be blocking waiting for a notification key that is never set. The worker will eventually work the job once block_for
times out, so job processing won't stop completely. I think in most cases you would upgrade all of your servers at the same time, so this would be a rare edge case.
This will be fixed in 5.8: https://github.com/laravel/framework/pull/27336
Thanks for everyone's help with this!
Description:
The new redis blocking pop functionality introduced in https://github.com/laravel/framework/pull/22284 has issues if there are problems between the BLPOP and the ZADD. This means that any failures in the worker at this point will lose the jobs that have been popped from the queue. These jobs will not be retried by another worker since they were never pushed to the reserved queue.
The problematic code is in RedisQueue::blockingPop.
https://github.com/laravel/framework/blob/63b61080b1383f8570526f85caf09dfea24c146d/src/Illuminate/Queue/RedisQueue.php#L225-L241