Closed fideloper closed 4 years ago
This sounds like a good idea. I think @themsaid is the best person to guide you in the right direction for this.
Looking into this a bit more, it looks like the signals listened to by Horizon are: https://github.com/laravel/horizon/blob/master/src/ListensForSignals.php
Via the ListensForSignals
trait:
Within the artisan horizon
console command:
In all cases:
SIGINT
which terminates horizon immediately (I don't see this signal being specifically handled in the code, but as with most programs, this works to kill Horizon)SIGTERM
appears to always attempt to kill gracefully (horizon:terminate
sends a SIGTERM
to each Master Process).
SIGTERM
, Horizon should stop gracefully but does not.SIGTERM
by default, but for Horizon, this appears to kill it before other processes can stop gracefullyThis may be something that needs configuration within supervisord / systemd - a timeout to wait for the process to exit when a SIGTERM
is sent.
However I wonder if it can be made to "hang" and not asynchronously terminate when horizon:terminate
is used, and/or when a SIGTERM is sent to the parent process.
We are also facing issues when using horizon:terminate
(wait is enabled by default as fast_termination
is set to false
into horizon configuration).
We are using ssh connection into our jobs, and using horizon:terminate
does not wait the process to terminate (Job are failing because connection is lost) :(
We are using supervisor to manage horizon master process.
Same here, this is a real problem for us, as we are using Kuberenetes for running our queues (multiple horizon instances on several nodes with several pods each), and running horizon:terminate is not an option (this would cause ALL horizon instances to stop). Rolling update strategy sends SIGTERM signals (is configurable) to each instance of horizon (pod) separately and waits for it to terminate/exit before spawning new pods/instances.
It would be phenomenal if the usual SIGTERM to a horizon process would cause it to gracefully stop (with all workers) before exiting. Without this, controlled rolling updates via Kubernetes are not possible, or we are missing an obvious way to handle this.
@graemlourens if you're up to it, you're free to send in a PR for this.
If @themsaid has time, I’d love some clarification on if I’m correct based on looking at the code:
In order for Horizon to stop gracefully (rather than just restart gracefully), would we need each worker to be polled by the master to see if it’s completely finished its jobs? Could each worker report being finished using redis pubsub?
That way we could have Horizon’s master process “hang” when stopping until all workers are complete/stopped.
We may also need to see if a new command or flag on horizon:terminate
is necessary (or if we can change what a sigterm
to the master process does).
@driesvints would it be possible to get the 'hint' required by @fideloper to be able to continue analysis of this ticket? It seems to me that gracefully stopping should be an important part of horizon as queues are now the backbone of many crucial system parts.
Currently i'm not able to provide PR's but happy to invest time into testing scenarios and reliability in case there is a proposal.
@graemlourens I don't have time to look at this myself atm but I'll try to ping @themsaid.
When SIGTERM
is sent to the process running php artisan horizon
it's killed gracefully, so for example a simple kill PID
sends the signal and Horizon waits until all workers are done and then terminates.
I'm not sure if I'm getting this wrong but it seems to already stop gracefully. @fideloper?
When SIGTERM is sent to the process running php artisan horizon it's killed gracefully, so for example a simple kill PID sends the signal and Horizon waits until all workers are done and then terminates.
But supervisor, the recommended process manager, will immediately start it afterwards, won't it?
And I think herein lies an another issue together with supervisor:
SIGKILL
your processI believe the supervisor config stopwaitsecs
is involved:
The number of seconds to wait for the OS to return a SIGCHLD to supervisord after the program has been sent a stopsignal. If this number of seconds elapses before supervisord receives a SIGCHLD from the process, supervisord will attempt to kill it with a final SIGKILL.
Or am I wrong?
The issue as I understand is that SIGTERM
isn't handled correctly by Horizon to gracefully kill the process, which is not true.
stopwaitsecs
could be involved in this yes.
Phew, apologies for wasting time and thanks for the clarification.
While the systemd
hack works (to get around systemd choosing to send a SIGKILL
immediately), supervisorctl
DOES work as @themsaid described (yay!)
The issue on my end was stopwaitsecs
was set to the default 10 seconds and my queue jobs run anywhere from a few minutes to an hour.
Setting stopwaitsecs=3600
(1 hour in seconds) is appropriate in my case.
I think it's not uncommon that developers are the first time introduced to supervisor due to horizon.
AFAICS the official docs at https://laravel.com/docs/6.x/horizon don't mention this. I mean I get it, they can't cover every possible case, but pointing towards stopwaitsecs
might be useful there…
We're adding this to the docs :)
Thanks everyone!
Thank you so much
Hi!
I have an issue where I need to be able to stop Horizon gracefully.
Currently it works great for restarts (which you would want after a deploy), but stopping Horizon gracefully is an issue due to how Supervisord and System work.
Restarting Works Great
Supervisord / Systemd is usually set to restart Horizon when it stops (even if it stops with a non-zero exit code). This works great for restarts, as we can use
horizon:terminate
and/orhorizon:terminate --wait
to have it gracefully stop workers, knowing the process monitor will start Horizon back up.However,
horizon:terminate
is async. It returns immediately (before the workers are finished). This means that any follow up command is run immediately.This is an issue for stopping Horizon.
Stopping is Not Great™
Supervisord can only send signals to stop a process. A
SIGTERM
to Horizon stops everything immediately - it's not graceful. So short of listening for an alternative signal, or changing Horizon soSIGTERM
is graceful, Supervisord has no way to stop Horizon gracefully.Systemd has a
ExecStop
option that could runartisan horizon:terminate
. However, Systemd sends aSIGTERM
immediately after theExecStop
command runs.Since
horizon:terminate
is async, Systemd sends aSIGTERM
immediately and kills Horizon too soon.Why?
This is tough on long running processes that can't necessarily be killed and tried again (think ChipperCI running builds that can last anywhere from a few minutes to an hour).
The use case I have specifically is using AutoScaling to create and terminate servers dynamically. Terminating a server with a running job would definitely cut off someone's build.
Possible Solutions
I have two ideas:
SIGTERM
does to shut down gracefully. This would work for any process monitor.horizon:terminate
command can include a--hang
(or similar) flag to wait until workers are shutdown (instead of being async and returning immediately). This would allow Systemd to useExecStop
.I'm willing to help with a PR for any solution, but I'll need some direction in where to find some code due to Horizon's complexity. I'm guessing this may involve some polling to figure out the state of each worker, but I'm not 100% clear on that.
A Hacky Solution
If the above ideas are too complex (I can certainly see that being the case), I have a hacky solution that works in Systemd (there's no equivalent in Supervisord):
File (in Ubuntu 18.04):
/lib/systemd/system/myjobs.service
.Use the following to see it working:
Let me know...
If you have other ideas or think I'm misunderstanding something, definitely let me know!