Horizon can restart gracefully but cannot stop gracefully

fideloper commented 5 years ago

Hi!

I have an issue where I need to be able to stop Horizon gracefully.

Currently it works great for restarts (which you would want after a deploy), but stopping Horizon gracefully is an issue due to how Supervisord and System work.

Restarting Works Great

Supervisord / Systemd is usually set to restart Horizon when it stops (even if it stops with a non-zero exit code). This works great for restarts, as we can use horizon:terminate and/or horizon:terminate --wait to have it gracefully stop workers, knowing the process monitor will start Horizon back up.

However, horizon:terminate is async. It returns immediately (before the workers are finished). This means that any follow up command is run immediately.

This is an issue for stopping Horizon.

Stopping is Not Great™

Supervisord can only send signals to stop a process. A SIGTERM to Horizon stops everything immediately - it's not graceful. So short of listening for an alternative signal, or changing Horizon so SIGTERM is graceful, Supervisord has no way to stop Horizon gracefully.

Systemd has a ExecStop option that could run artisan horizon:terminate. However, Systemd sends a SIGTERM immediately after the ExecStop command runs.

Since horizon:terminate is async, Systemd sends a SIGTERM immediately and kills Horizon too soon.

Why?

This is tough on long running processes that can't necessarily be killed and tried again (think ChipperCI running builds that can last anywhere from a few minutes to an hour).

The use case I have specifically is using AutoScaling to create and terminate servers dynamically. Terminating a server with a running job would definitely cut off someone's build.

Another option of course is re-architecting how jobs are run so they're split up into more idempotent tasks. We'll cross that bridge if we need as it would be a pretty big task.

Possible Solutions

I have two ideas:

The master process (the one that Supervisord/Systemd monitors) can listen for an alternative signal or change what SIGTERM does to shut down gracefully. This would work for any process monitor.
The horizon:terminate command can include a --hang (or similar) flag to wait until workers are shutdown (instead of being async and returning immediately). This would allow Systemd to use ExecStop.

I'm willing to help with a PR for any solution, but I'll need some direction in where to find some code due to Horizon's complexity. I'm guessing this may involve some polling to figure out the state of each worker, but I'm not 100% clear on that.

A Hacky Solution

If the above ideas are too complex (I can certainly see that being the case), I have a hacky solution that works in Systemd (there's no equivalent in Supervisord):

File (in Ubuntu 18.04): /lib/systemd/system/myjobs.service.

[Unit]
Description=Long Running Job Worker

[Service]
User=forge
WorkingDirectory=/home/forge/myapp.com

# Restart "always" so we can run horizon:terminate during deploys
Restart=always
ExecStart=/usr/bin/php artisan horizon --environment=production

# Allow graceful stop/restart by running a command
# to stop, rather than sending a SIGTERM signal
ExecStop=/usr/bin/php artisan horizon:terminate --wait

# This signal is tells horizon Horizon to "unpause". We're using
# it as hack so Systemd doesn't send a SIGTERM prematurely
KillSignal=SIGCONT

# A long (1 hour) timeout to give jobs time to finish
TimeoutStopSec=3600

[Install]
WantedBy=multi-user.target

Use the following to see it working:

sudo systemctl enable myjobs
sudo systemctl start myjobs
sudo systemctl status myjobs
# run a long-running job and half-way through, run:
sudo systemctl stop myjobs
# It should hang until the workers stop gracefully

Let me know...

If you have other ideas or think I'm misunderstanding something, definitely let me know!

driesvints commented 5 years ago

This sounds like a good idea. I think @themsaid is the best person to guide you in the right direction for this.

fideloper commented 5 years ago

Looking into this a bit more, it looks like the signals listened to by Horizon are: https://github.com/laravel/horizon/blob/master/src/ListensForSignals.php

Via the ListensForSignals trait:

SIGTERM
SIGUSR1
SIGUSR2
SIGCONT

Within the artisan horizon console command:

SIGTERM

In all cases:

Ctrl+c sends a SIGINT which terminates horizon immediately (I don't see this signal being specifically handled in the code, but as with most programs, this works to kill Horizon)
A SIGTERM appears to always attempt to kill gracefully (horizon:terminate sends a SIGTERM to each Master Process).
- However, it looks like when the parent process receives a SIGTERM, Horizon should stop gracefully but does not.
- Supervisord (and other process monitors) send SIGTERM by default, but for Horizon, this appears to kill it before other processes can stop gracefully

This may be something that needs configuration within supervisord / systemd - a timeout to wait for the process to exit when a SIGTERM is sent.

However I wonder if it can be made to "hang" and not asynchronously terminate when horizon:terminate is used, and/or when a SIGTERM is sent to the parent process.

prestamodule commented 4 years ago

We are also facing issues when using horizon:terminate (wait is enabled by default as fast_termination is set to false into horizon configuration).

We are using ssh connection into our jobs, and using horizon:terminate does not wait the process to terminate (Job are failing because connection is lost) :(

We are using supervisor to manage horizon master process.

graemlourens commented 4 years ago

Same here, this is a real problem for us, as we are using Kuberenetes for running our queues (multiple horizon instances on several nodes with several pods each), and running horizon:terminate is not an option (this would cause ALL horizon instances to stop). Rolling update strategy sends SIGTERM signals (is configurable) to each instance of horizon (pod) separately and waits for it to terminate/exit before spawning new pods/instances.

It would be phenomenal if the usual SIGTERM to a horizon process would cause it to gracefully stop (with all workers) before exiting. Without this, controlled rolling updates via Kubernetes are not possible, or we are missing an obvious way to handle this.

driesvints commented 4 years ago

@graemlourens if you're up to it, you're free to send in a PR for this.

fideloper commented 4 years ago

If @themsaid has time, I’d love some clarification on if I’m correct based on looking at the code:

In order for Horizon to stop gracefully (rather than just restart gracefully), would we need each worker to be polled by the master to see if it’s completely finished its jobs? Could each worker report being finished using redis pubsub?

That way we could have Horizon’s master process “hang” when stopping until all workers are complete/stopped.

We may also need to see if a new command or flag on horizon:terminate is necessary (or if we can change what a sigterm to the master process does).

graemlourens commented 4 years ago

@driesvints would it be possible to get the 'hint' required by @fideloper to be able to continue analysis of this ticket? It seems to me that gracefully stopping should be an important part of horizon as queues are now the backbone of many crucial system parts.

Currently i'm not able to provide PR's but happy to invest time into testing scenarios and reliability in case there is a proposal.

driesvints commented 4 years ago

@graemlourens I don't have time to look at this myself atm but I'll try to ping @themsaid.

themsaid commented 4 years ago

When SIGTERM is sent to the process running php artisan horizon it's killed gracefully, so for example a simple kill PID sends the signal and Horizon waits until all workers are done and then terminates.

I'm not sure if I'm getting this wrong but it seems to already stop gracefully. @fideloper?

mfn commented 4 years ago

When SIGTERM is sent to the process running php artisan horizon it's killed gracefully, so for example a simple kill PID sends the signal and Horizon waits until all workers are done and then terminates.

But supervisor, the recommended process manager, will immediately start it afterwards, won't it?

And I think herein lies an another issue together with supervisor:

if you kill the horizon process directly, supervisor will restart it afterwards
if you however use supervisor to tell to kill horizon to completely stop it, the "waits until all workers" is a problem because at some point supervisor will SIGKILL your process

I believe the supervisor config stopwaitsecs is involved:

The number of seconds to wait for the OS to return a SIGCHLD to supervisord after the program has been sent a stopsignal. If this number of seconds elapses before supervisord receives a SIGCHLD from the process, supervisord will attempt to kill it with a final SIGKILL.

Or am I wrong?

themsaid commented 4 years ago

The issue as I understand is that SIGTERM isn't handled correctly by Horizon to gracefully kill the process, which is not true.

stopwaitsecs could be involved in this yes.

fideloper commented 4 years ago

Phew, apologies for wasting time and thanks for the clarification.

While the systemd hack works (to get around systemd choosing to send a SIGKILL immediately), supervisorctl DOES work as @themsaid described (yay!)

The issue on my end was stopwaitsecs was set to the default 10 seconds and my queue jobs run anywhere from a few minutes to an hour.

Setting stopwaitsecs=3600 (1 hour in seconds) is appropriate in my case.

mfn commented 4 years ago

I think it's not uncommon that developers are the first time introduced to supervisor due to horizon.

AFAICS the official docs at https://laravel.com/docs/6.x/horizon don't mention this. I mean I get it, they can't cover every possible case, but pointing towards stopwaitsecs might be useful there…

themsaid commented 4 years ago

We're adding this to the docs :)

driesvints commented 4 years ago

Thanks everyone!

kenny-lee-1992 commented 3 years ago

Thank you so much

laravel / horizon