laravel / framework

The Laravel Framework.
https://laravel.com
MIT License
32.21k stars 10.9k forks source link

Exit signal handing in queue:listen not working #38116

Closed emil-nasso closed 3 years ago

emil-nasso commented 3 years ago

Description:

We have multiple "microservice style" applications that run laravel. Most of them have a queue and uses artisan queue:work or artisan queue:listen to process jobs in the queue.

We have defaulted to queue:work but due to some circumstances in some of the applications (legacy code and more...) we need to use queue:listen for some, due to how queue:work reuses the same process for all jobs.

We run this in docker containers but have started noticing that exit signals are not handled the same in queue:listen as it is in queue:work. When upgrading we shut down the containers by sending a SIGTERM to queue:work which will handle the shutdown gracefully by completing any current jobs and shutting down. In case some job is stuck for some reason, there is a timeout in place that "kills" the container if queue:work is taking too long to shut down.

This works great with queue:work. The upgrades and shotdowns are normally very fast and responsive.

We are not experiencing this with queue:listen however. It seems like queue:listen doesn't handle the exit signals at all. It doesn't respond to SIGTERM by finishing current jobs and shutting down. Instead the timeout passes and the containers are killed.

The result is that upgrades are performers in seconds (or a few minutes) with queue:work while the timeout determines how long a upgrade can take for queue:listen. We are also at risk of having jobs being executed be killed in the middle of the execution when running queue:listen. With some of our applications an upgrade goes from taking 20 seconds to perform with queue:work to 30+ minutes with queue:listen .

Is there any reason for this behavior, or any limitations that I don't know of or should it be regarded as a bug?

I have created a repository with a reproduction of this problem. I installed a new laravel installation today and committed it to the repo. I then added two docker-files and a bash-script to the second commit. The repository is available here: https://github.com/emil-nasso/laravel-exit-signals-reprod

The commit with the reproduction: https://github.com/emil-nasso/laravel-exit-signals-reprod/commit/7cfce751158744937141cf64f4d518319bd828eb

Running the reproduction should only require:

git clone git@github.com:emil-nasso/laravel-exit-signals-reprod.git
cd laravel-exit-signals-reprod
./run-test.bash

The script will build both docker images, start two containers and then stop them, while displaying timing information. The only difference between the two docker images is that one runs queue:work as the entrypoint and the other runs queue:listen.

You will notice (hopefully) that the queue:work container shuts down instantly (at least very fast...) while the queue:listen containers take ten seconds to shut down. Ten seconds is the default timeout for docker stop.

image

Please let me know if there is any other information I can provide or if the reproduction doesn't work for you.

derekmd commented 3 years ago

AFAIK queue:listen is mostly a local dev tool for synchronous queue processing. Maybe it's OK for production when every command is idempotent with database transactions and doesn't use external services? The possibility for duplicate sent emails, delivered text messages, broadcast event listeners, etc. make running it in production a non-starter is most apps.

It seems like queue:listen doesn't handle the exit signals at all. ... We are also at risk of having jobs being executed be killed in the middle of the execution when running queue:listen.

https://github.com/laravel/framework/blob/8.x/src/Illuminate/Queue/Worker.php#L676-L688

That SIGTERM event registration allows queue:work to finish the current command being processed and exit gracefully after completion. queue:listen doesn't have this framework-defined async handling. So SIGTERM on the latter will defer OS process exit behavior to the PHP binary running the script. Likely it'll exit immediately. Or longer if PHP is waiting on file I/O or a network stream?

The Laravel docs paragraph on queue:listen warn its focus isn't on worker restarts or queue performance.


we need to use queue:listen for some [jobs], due to how queue:work reuses the same process for all jobs.

For queue:work, as https://laravel.com/docs/8.x/queues#resource-considerations covers you may need to make each one of your job classes cleanup after themselves so the same Laravel Application can run many jobs.

public function handle()
{
    try {
        // job code
    } finally {
        // clear facade instances, app singletons, circular references,
        // or shared/conflicting global resources that can't be carried into the next job
    }
}
driesvints commented 3 years ago

Thanks @derekmd

emil-nasso commented 3 years ago

AFAIK queue:listen is mostly a local dev tool for synchronous queue processing. Maybe it's OK for production when every command is idempotent with database transactions and doesn't use external services? The possibility for duplicate sent emails, delivered text messages, broadcast event listeners, etc. make running it in production a non-starter is most apps.

Thanks for the response.

Can you go a bit more in-depth on this?

I find very little information about the purpose of queue:listen. I find nothing in the documentation that indicates that it shouldn't be used in production.

Why would it be a problem to use it with external services or why would there be a risk of sending duplicate emails, compared to queue:work ?

derekmd commented 3 years ago

Example:

  1. queue:listen grabs a mailable from the Redis queue
  2. the child process mailable calls an external API to send the email
  3. terminal CTRL+C or deploy/Docker-triggered SIGTERM will end the queue:listen instance and its mailable child process
  4. the mailable queue item is in the Redis :reserved queue and Laravel's JobProcessed event hasn't been dispatched so the job isn't completed
  5. if you manually replay that mailable job (through Laravel Horizon, etc.), the same email will be sent

You wouldn't have to worry about this with an external API like Stripe that has an Idempotency-Key parameter to ensure duplicate requests aren't accepted (important for payments): https://stripe.com/docs/api/idempotent_requests

Likewise you'd see issues if database transactions aren't used. Updating 3 models in a job, one row may be changed but with SIGTERM prematurely reached, the other two queries haven't been run. So you'd need a database transaction to commit after all queries have been executed.

For above, queue:work wouldn't kill the mailable job mid-run. The whole job lifecycle including the JobProcessed event will be run and then the command exits.

I find nothing in the documentation that indicates that it shouldn't be used in production.

emil-nasso commented 3 years ago

@derekmd Thanks for the insight, very helpful.

The example issues all comes down to queue:listen not handing signals "gracefully".

Has there ever been any discussion about implementing signal handling for queue:listen as it is implemented in queue:work? Maybe this issue should be regarded as a feature request instead?

I will make a pull-request to update the documentation and I will reference this issue there. I will most likely be needing some feedback on the phrasing in the documentation. :)