laravel / horizon

Dashboard and code-driven configuration for Laravel queues.
https://laravel.com/docs/horizon
MIT License
3.87k stars 657 forks source link

Horizon not detecting dead supervisor process #599

Closed szokeptr closed 5 years ago

szokeptr commented 5 years ago

Description:

It looks like Horizon is not detecting if supervisor processes are killed and this results in no worker processes running.

I am not sure if this is intended behavior or a bug, but it is really an issue for me. For example supervisor processes can die when the OS has a low max file descriptor setting, which is obviously a different issue, but it would be great if horizon could handle this case too.

I am now considering to switching back to regular Laravel workers, as this is not an issue when you are using supervisord to supervise the workers.

Steps To Reproduce:

  1. Start horizon: php artisan horizon
  2. Get process ID of the supervisor process: ps aux | grep horizon:supervisor
  3. Kill process with any signal: kill PID
driesvints commented 5 years ago

Hey there,

Unfortunately we don't support this version anymore. Please check out our support policy on which versions we are currently supporting. Can you please try to upgrade to the latest version and see if your problem persists? We'll help you out and re-open this issue if so.

Thanks!

szokeptr commented 5 years ago

@driesvints I can confirm this issue exists on Laravel v5.8.18 and Horizon v3.2.1 (on a fresh installation):

Peter-MacBook-Pro:~ szokeptr$ ps | grep horizon
 9488 ttys004    0:00.00 grep horizon
 9403 ttys005    0:00.18 php artisan horizon
 9407 ttys005    0:00.17 /usr/local/Cellar/php/7.3.4/bin/php artisan horizon:supervisor peter-macbook-prolocal-4bIg:supervisor-1 redis --delay=0 --memory=128 --queue=default --sleep=3 --timeout=60 --tries=3 --balance=simple --max-processes=3 --min-processes=1 --nice=0
 9424 ttys005    0:00.16 /usr/local/Cellar/php/7.3.4/bin/php artisan horizon:work redis --delay=0 --memory=128 --queue=default --sleep=3 --timeout=60 --tries=3 --supervisor=peter-macbook-prolocal-4bIg:supervisor-1
 9425 ttys005    0:00.16 /usr/local/Cellar/php/7.3.4/bin/php artisan horizon:work redis --delay=0 --memory=128 --queue=default --sleep=3 --timeout=60 --tries=3 --supervisor=peter-macbook-prolocal-4bIg:supervisor-1
 9426 ttys005    0:00.15 /usr/local/Cellar/php/7.3.4/bin/php artisan horizon:work redis --delay=0 --memory=128 --queue=default --sleep=3 --timeout=60 --tries=3 --supervisor=peter-macbook-prolocal-4bIg:supervisor-1
Peter-MacBook-Pro:~ szokeptr$ kill 9407
Peter-MacBook-Pro:~ szokeptr$ ps | grep horizon
 9550 ttys004    0:00.00 grep horizon
 9403 ttys005    0:00.19 php artisan horizon
Peter-MacBook-Pro:~ szokeptr$
driesvints commented 5 years ago

AFAIK I don't think Horizon can do this so I'll mark this as a feature request.

szokeptr commented 5 years ago

If I kill a worker process in the same way a new one is started, so I think Horizon is has the ability to do this. If I figure it out I'll be happy to send a PR.

driesvints commented 5 years ago

It seems that we don't restart on purpose for the exit codes 0, 2 and 13: https://github.com/laravel/horizon/blob/a9204280f72a1c6d3874e0f98b68e354b212311a/src/SupervisorProcess.php#L38-L42

But I don't know why tbh.

driesvints commented 5 years ago

It seems that 0 & 2 are exit codes where a Supervisor on Linux doesn't restarts for so it's wanted for us to ignore these. Which exist code is returned when you kill the process?

szokeptr commented 5 years ago

It’s hard to reproduce, but I will try to put together a minimal setup to demonstrate and check the exit code.

lonnylot commented 5 years ago

@szokeptr Are you using Supervisord per the docs? (https://laravel.com/docs/5.8/horizon#deploying-horizon)

szokeptr commented 5 years ago

@lonnylot yes, of course. The problem is that horizon ignores the death of the supervisor process (that is launched by it) and because of this no worker processes are active for that supervisor anymore. This is an extremely rare edge-case as the supervisor process has no reason to crash, in my case the root cause was that the system hit the max open files limit and PHP was not able to start properly.

I still think that this is a valid issue, and something should be done so I will definitely investigate it a bit more as soon as I have time - I will close this until then.

amravazzi commented 4 years ago

+1 here. I get the message "A supervisor with this name is already running", then "INFO reaped unknown pid [$pid]" and then the horizon supervisors are no longer shown in the horizon dashboard.

fabriciojs commented 3 years ago

Deploying to a Kubernetes cluster I am facing this issue. I could not yet defined what exactly is killing the supervisor, but it eventually dies and won't return. We have other supervisors up and running in the container.

It may be related to resources limitation but we are still trying to pin point that.

amravazzi commented 3 years ago

After debugging, researching and suffering I gave up on Horizon in this version of Laravel (mainly because the maintainers' response to this and other bugs is (sic) "we no longer support this version of Laravel") and used a combination of supervisord processes and php artisan queue:listen. This solved all my problems with memory leak and allocation.

Here is my horizon.conf supervisord configuration file:

[supervisord]
nodaemon=true

[program:horizon]
process_name=%(program_name)s_%(process_num)02d
command=php /srv/app/artisan queue:listen --queue=<queues name>
autostart=true
autorestart=true
redirect_stderr=true
stdout_logfile=/dev/fd/1
stdout_logfile_maxbytes=0
killasgroup=true
stopasgroup=true
numprocs=4
fabriciojs commented 3 years ago

Identified that when PHP Horizon supervisor process was killed due to PHP's memory limit, right after it Horizon would say that the supervisor is already running. That is likely a race condition on how to tell if the supervisor is actually running or not.

amravazzi commented 3 years ago

@fabriciojs indeed, a related problem is, sometimes, Horizon workers may become zombie processes, consuming a lot of memory, which would lead your machine to run out of memory

Namoshek commented 3 years ago

It is also common that containers are killed by the OOMkiller when consuming too much memory. In case of Kubernetes, in such case only the container is restarted while the pod remains the same, which means the hostname remains the same. This leads to the mentioned race condition.