laravel / horizon

Dashboard and code-driven configuration for Laravel queues.
https://laravel.com/docs/horizon
MIT License
3.85k stars 645 forks source link

Horizon Orphan Process #178

Closed dbpolito closed 6 years ago

dbpolito commented 6 years ago

I'm running horizon on a latest forge machine as documented... the daemon with php artisan horizon and on deployments php artisan horizon:terminate but time to time i need to manually run php artisan horizon:purge.

This is the output i just got after after hours of last release:

$ php artisan horizon:purge
Observed Orphan: 17043
Observed Orphan: 30634
Observed Orphan: 31084
Observed Orphan: 31085

I can confirm it's orphans by running htop on tree mode (press f5) and i see these process as root process, not inside the master php artisan horizon process.


And also, every time i run purge, it ALWAYS wrongly see 1 process as orphan:

$ php artisan horizon:purge
Observed Orphan: 17059

I haven't found a pattern yet why this is happening.

agustinprod commented 6 years ago

I'm having the same issue. I woke up to my main server using all the ram and 4 gb of swap disk. Running purge killed 78 processes. I'm running horizon on a Ubuntu 16.04 with php 7.1.9-1

themsaid commented 6 years ago

Hello everyone,

Ideally you shouldn't have any rogue processes running horizon, but for exceptions like you shared we built the purge command for you to run on regular basis if your sever is having many rogue processes, this is a safe approach until we're able to investigate the issue more.

So I recommend that you schedule the command to run every day.

On the other hand, can you please share some information about your setup? I'm trying to understand what might cause rogue processes.

dillingham commented 6 years ago

Forge Daemon command: php artisan horizon directory: /home/forge/sitename.com/current In envoyer when I just added php artisan horizon:purge to my deployment hook, it purged 20+ I began investigating when I noticed old code being used even though I horizon:terminate

agustinprod commented 6 years ago

Our envoy deploy script uses horizon terminate to kill the processes, the high number of orphan procesess maybe are related to this. We added a horizon purge too at the deployement script. @themsaid if you find difficult to reproduce this bug, we could arrange ssh access to my vps.

jkudish commented 6 years ago

Sorry to hijack the thread, but I'm seeing weird behaviour with Horizon running old code after I've deployed fresh code. Wondering if it's related to the orphans.

I've got my stuff setup through envoyer/forge.

I've setup the following hook in Envoyer, I currently have it running it after Purge Old Releases -- is that where others set it up too or does it make more sense to have it at a different spot?

image

For now, as a workaround I have been manually restarting the deamon in forge whenever I deploy, but this is annoying and defeats the purpose of having Envoyer in the first place:

image

I also notice that if I ssh into the server and run php artisan horizon:purge it always find at least 1 orpha process, but sometimes up to a dozen.

Any ideas @themsaid ?

taylorotwell commented 6 years ago

It's hard to track down why a process would go orphaned. You can technically run horizon:purge after every deploy if you wanted to.

jkudish commented 6 years ago

Hi @taylorotwell :)

I do already, as per the screenshot above, but they still seem to go orphaned all the time. I don't really know what that means or if it's a real issue though. But it not running my latest code after a deploy is more of a problem.

dbpolito commented 6 years ago

Well, this basically happens when you run php artisan horizon:terminate while you have running jobs...

After one deploy:

$ php artisan horizon:purge
Observed Orphan: 548
Observed Orphan: 1879

$ php artisan horizon:purge
Observed Orphan: 548
Observed Orphan: 1898

$ php artisan horizon:purge
Observed Orphan: 1941

Interesting thing is that it displayed 548 twice.

After another deploy:

$ php artisan horizon:purge
Observed Orphan: 1857
Observed Orphan: 1862
Observed Orphan: 2301
$ php artisan horizon:purge
Observed Orphan: 2320

My job takes an average of 30 seconds, so maybe the problem is how much time terminate waits for the jobs to complete to kill them...

peterlupu commented 6 years ago

@jkudish I'm experience similar things on my server. When I deploy, at the end, I run php artisan horizon:terminate. Sometimes when running terminate, Sending TERM Signal To Process: **** is not displayed. Even when ran manually from the console.

I've checked this and something seems off:

  1. Made a Job (no params in constructor, so they're not serialized) which checks an array from a class for a number.
  2. If it cannot find said number, I throw and Exception.
  3. I deploy the code with an array of [1,2,3] and run the job checking for the number 5.
  4. It fails with the above Exception.
  5. I deploy my code again, adding 5 to the array, so the job would stop failing.
  6. But that doesn't happen, I end up having to run php artisan horizon:terminate a few more times before it picks up on the change. Not sure if it's time related.
mfn commented 6 years ago

Hello everyone 😄

First off: great product!

However we deployed Horizon yesterday to production and it didn't take long until we realized we have orphaned processes :-/

Regarding the "always wrongly 1 orphan"

@dbpolito (at al)

And also, every time i run purge, it ALWAYS wrongly see 1 process as orphan:

I can reproduce this. I always get shown a PID which is untraceable as to what process it belongs. I did do some debugging (e.g. using pgrep -f … -a or even ps auxw snapshotting) but never "saw" a process which had this particular ID.

I also used https://github.com/a2o/snoopy to log which processes are spawned and this particular PID never showed up. E.g. in on case I got reported an orphaned PID 24575 but that specific PID wasn't tracked; but other invocations were:

Nov  7 10:23:05 dev snoopy[24568]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -P 21632
Nov  7 10:23:05 dev snoopy[24570]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -P 21631
Nov  7 10:23:05 dev snoopy[24572]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon -a
Nov  7 10:23:05 dev snoopy[24574]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon:purge -a
Nov  7 10:23:05 dev snoopy[24576]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon
Nov  7 10:23:05 dev snoopy[24578]: [uid:1000 sid:11739 tty:/dev/pts/4 cwd:/vagrant/ filename:/usr/bin/pgrep]: pgrep -f horizon:purge

I don't think it's a coincidence that all pgrep invocations skip a PID when the next one is executed; like it is consuming a PID internally.

This PID is received from the call to:

$this->exec->run('pgrep -f horizon')

as part of \Laravel\Horizon\ProcessInspector::current.

But even when I do run var_dump( $this->exec->run('pgrep -f horizon a'); I get a different PID in the end. I.e. two consecutive runs of pgrep -f horizon a and pgrep -f horizon` produce a different last PID in the result which ends up being "the one orphan" I always see.

OTOH I noticed that due to using pgrep -f horizon the killing of processes is very greedy. I'm using supervisord and have configured my horizon master process to log into a file named horizon.log. I was tailing this file with tail -f horizon.log and ./artisan horizon:purge will kill this command.

So depending on what other processes you have running on the system which have the name horizon in the name/arguments, this may be dangerous and you can experience random process killing. I guess the changes are slim but it never hurts to know this.

Our orphaned process problem

Yesterday we realized we had two orphaned processes. Their characteristics:

We're running Ubuntu 14.04 LTS with supervisord 3.0b2-1. Here's how a healthy process tree looks like:

init(1)-+-acpid(1230)
        |-supervisord(1398)-+-php(1613)-+-php7.1(1771)-+-php7.1(1915)
        |                   |           |              `-php7.1(18509)
        |                   |           |-php7.1(1772)-+-php7.1(1914)
        |                   |           |              `-php7.1(1918)
        |                   |           |-php7.1(1773)---php7.1(1912)
        |                   |           |-php7.1(1774)---php7.1(1930)
        |                   |           |-php7.1(1775)-+-php7.1(1913)
        |                   |           |              `-php7.1(1917)
        |                   |           |-php7.1(1776)-+-php7.1(1916)
        |                   |           |              `-php7.1(1920)
        |                   |           |-php7.1(1777)---php7.1(1921)
        |                   |           |-php7.1(1778)-+-php7.1(1923)
        |                   |           |              |-php7.1(1924)
        |                   |           |              |-php7.1(1926)
        |                   |           |              `-php7.1(1927)
        |                   |           `-php7.1(1779)---php7.1(1928)
        |                   |-php(28235)
        |                   |-php(28236)
        |                   |-php(28269)
        |                   `-php(28409)

Here how the rogue processes did appear (I removed the surroundings):

init(1)-+-acpid(1230)
        |-php(10430)
        |-php(10494)

I know these were the correct processes because with ps auxw I saw their arguments matching the information I described above.

When does re-parenting happen? E.g. https://unix.stackexchange.com/a/152400/7924

You can not start a process as the child of the shell, and then "reparent" it so another process becomes it's parent. init with PID 1 is an exception, processes can become it's child as it collects processes that lost their original parent process.

So why can/does this happen in case of supervisor/horizon/etc.?

Some guesses (?):

Lots of guessing as you see. For now we will also add the purge command but seeing that there's something off it's clear a code fix would be desired. I can assist with more information / debugging once I get new orphaned processes.

PS: we're also using https://github.com/wa0x6e/Cake-Resque to drive background job from a CakePHP project. It has its own set of problems but we never experienced such kind if orphaned. Or: at least not at that rate.

ghobaty commented 6 years ago

We are seeing the same weird issue where Horizon keeps running old code, even after using php artisan horizon:terminate and php artisan horizon:purge (using forge daemon). Any ideas what is going wrong and how to fix this? @jkudish could you find any working workarounds?

peterlupu commented 6 years ago

@elghobaty Try also running php artisan queue:restart. Also, see #213.

jkudish commented 6 years ago

@elghobaty I have my deploy restarting php-fpm and then php artisan horizon:terminate and php artisan horizon:purge and i haven't had any issues lately

Slickspacestech commented 6 years ago

I've noticed in our staging environment the same thing, orphaned horizon workers, stale code etc.

I find if I use horizon:terminate while supervisor is running it seems to cause it, where as if I stop the horizon supervisor, then pull, then start again it's okay

barryvdh commented 6 years ago

Should running 'purge' still let the workers finish their active jobs?

I'm also seeing a lot of orphaned process, running purge kills them cleanly. I'm using Deployer (so different release directories, symlink the active one to 'current' and using Forge to manage the Deamon)

royduin commented 6 years ago

Same issue here! And good question @barryvdh, the processes should finish their job, are they with php artisan horizon:purge?

This is Taylor his article about deployments with Horizon on Forge: https://medium.com/@taylorotwell/deploying-horizon-to-laravel-forge-fc9e01b74d84, php artisan horizon:terminate should be enough...

bbashy commented 6 years ago

I don't get this command

$ while true; do
while> php artisan horizon:purge
while> done
Observed Orphan: 28070
Observed Orphan: 28080
Observed Orphan: 28090
Observed Orphan: 28100
Observed Orphan: 28110
Observed Orphan: 28120
Observed Orphan: 28130
Observed Orphan: 28140
Observed Orphan: 28150
Observed Orphan: 28160
Observed Orphan: 28170
Observed Orphan: 28180
Observed Orphan: 28190
Observed Orphan: 28200
Observed Orphan: 28210
Observed Orphan: 28220
Observed Orphan: 28230
Observed Orphan: 28240
Observed Orphan: 28250
Observed Orphan: 28260
Observed Orphan: 28270
Observed Orphan: 28280
Observed Orphan: 28290
Observed Orphan: 28300
Observed Orphan: 28310
Observed Orphan: 28320
Observed Orphan: 28330
Observed Orphan: 28344
Observed Orphan: 28354
Observed Orphan: 28364
Observed Orphan: 28374
Observed Orphan: 28384
^C%
marianvlad commented 6 years ago

I have the same problem. I'm using Daemon and every time I'm deploying I need to kill processes manually pkill -f horizon because deleting the daemon still keeps some processes up with the old code. I'm using horizon on another server and I don't need to do that. Looks like it's something wrong with ubuntu or with php function for killing processes because artisan horizon:terminate it doesn't work every time.

taylorotwell commented 6 years ago

Yeah I would like to figure out a root cause or maybe easy recreation of the issue so we could track it down. I don't have a good understand of why it is happening since all horizon:work processes would be started from a supervisor and its unclear to me how the parent supervisor could die without also killing all its child processes.

taylorotwell commented 6 years ago

Hi all,

We have tagged 1.2.0 with a slightly different approach to terminating processes. Can some of you try it and report if you see any improvements?

marianvlad commented 6 years ago

I will try right now.

taylorotwell commented 6 years ago

@marianvlad thanks

bbashy commented 6 years ago

Trying now

🎉

Updating laravel/horizon (v1.1.1 => v1.2.0): Downloading (100%)
taylorotwell commented 6 years ago

If this doesn't fix it I'll probably be putting a nice little bounty on this :smile:

themsaid commented 6 years ago

Just a notice, the horizon:purge command doesn't work as expected, so if you run it and get a single rogue process ignore it, it's not an indicator. Only if you get multiple process it means there are actual orphans.

marianvlad commented 6 years ago

I ran horizon:terminate 3 times an then I deleted the daemon. In htop I have these https://i.imgur.com/t9RFWuE.png

themsaid commented 6 years ago

@marianvlad these could be from before, run the horizon:purge command, start the daemon, then monitor from there.

barryvdh commented 6 years ago

Can't we just send the queue:restart signal from horizon terminate? That seems to work reliable.

taylorotwell commented 6 years ago

@barryvdh that could be an option. To be honest, I've forgotten why it didn't work like that in the first place but Horizon changed a bit while I was in the process of writing it.

We can wait until we get a bit more feedback on the 1.2.0 termination process and then consider other options if we're not seeing improvement.

taylorotwell commented 6 years ago

@marianvlad did you make sure you started with zero horizon processes? Any more feedback?

marianvlad commented 6 years ago

@taylorotwell I think the PID is changed after horizon:terminate and that is the reason for remaining processes after deleting the daemon in Forge. I hope the remaining processes are not "real processes" with old code jobs. I will change something in my jobs and I will give a feedback.

mfn commented 6 years ago

So far I do not recall being able to reproduce this except observing it in production.

Haven't yet upgraded to latest horizon unfortunately, earliest possibility will be somewhen during the next week or later.

However, out of curiosity to everyone affected:

I observed all the orphaned workers I had so far where only of two queues which may have out of the ordinary longer SQL runtimes.

Slickspacestech commented 6 years ago

running postgres, seen it with 9.6 and 10, (although I've see it with mysql as well). all fast queries. I've seen it where the workers are all idle and I pull changes in staging. My jobs are all very simple, some trigger web requests which would be longer and some are just database inserts for small amounts of data.

I've tried using horizon terminate, and supervisor to stop horizon

I've noticed that in some situations I need to sudo purge vs just purge, but I am pretty certain that is dependant on the actual permissions my user has over the laravel directory.

honestly if you just document it really well it's not a big deal to purge. It is however a total mind f*** when you are making small incremental changes to tweak something and the behavior is completely unexpected or throws an error suggesting a line item in a class that no longer exists, or hasn't in weeks...

taylorotwell commented 6 years ago

Hey all,

Just wanted to note we have made some further tweaks on the latest 1.2.2 release, including using the queue:restart cache logic approach in addition to a couple other fixes.

Please let us know how this release works for you. We rely on your feedback :heart:

marianvlad commented 6 years ago

I did multiple tests on 1.2.0 on a clean project everything worked perfectly except in my real project.

My setup looks like this:

config/queue.phh

'redis' => [
    'driver' => 'redis',
    'connection' => 'default',
    'queue' => 'default',
    'retry_after' => 7200, // because my jobs takes time
    'block_for' => null, // without this thing, I use 5.5
],

config/horizon.php

'production' => [
    'supervisor-1' => [
        'connection' => 'redis',
        'queue' => ['default'],
        'balance' => false,
        'processes' => 3,
        'tries' => 1,
    ],
    'supervisor-2' => [
        'connection' => 'redis',
        'queue' => ['another-test'],
        'balance' => false,
        'processes' => 3,
        'tries' => 1,
    ],
    'supervisor-3' => [
        'connection' => 'redis',
        'queue' => ['test'],
        'balance' => false,
        'processes' => 5,
        'tries' => 1,
    ],
],

A sample of my job:

<?php

namespace App\Jobs;

use Bunch\Of\Classes; //

class PostJob implements ShouldQueue
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public $timeout = 3600;

    protected $domain;
    protected $video;
    protected $temp;

    /**
     * Create a new job instance.
     *
     * @return void
     */
    public function __construct(Domain $domain, Video $video)
    {
        $this->domain = $domain;
        $this->video = $video;

        // this temp is not visible inside failed method if I put in handle() method
        $this->temp = $this->tempPath();
    }

    /**
     * Execute the job.
     *
     * @return void
     */
    public function handle()
    {
        if (! $condition) {
            // some checks
            $this->cleanTemp(); // clean directory
            return;
        }

        $melody = resolve(Melody::class)
            ->setDomain($this->domain->endpoint)
            ->setKey('key-blah-blah');

        // Check if domain is up before
        if ($melody->ping()) {
            $storage = $this->domain->storages->random();

            $twig = resolve('twig');

            $template = $this->domain->template();
            $thumbnail = $this->video->thumbnail_url;

            if (! $thumbnail && ! urlHealth($thumbnail)) {
                $this->thumbnail();

                $thumbnail = asset("storage/temp/".basename($this->temp)."/thumbnail.jpg");
            }

            // Crate video preview
            $preview = (new PreviewService($this->domain, $this->video, $this->temp))->handle();

            $twigData = FormatVideo::twigAttributes($this->video);

            // Upload the video
            $upload = (new UploadService(
                $storage,
                $this->video->videoPath(),
                str_slug($twigData['username']) . '_' . str_slug($this->video->title . ' ' . $twigData['date']).'.'.last(explode('.', $this->video->videoPath()))
            ))->handle();

            if (! $upload) {
                $this->cleanTemp();
                return;
            }

            $storage->increaseUpload($this->video->size);

            $title = $twig->createTemplate($template['title'])->render($twigData);
            $description = str_limit($twig->createTemplate($template['description'])->render($twigData), 3000);
            $tags = FormatVideo::getTagsByType($this->video, $template, $twigData);

            $melody = $melody
                ->uploadVideo($title, $description, [
                    'media_type' => 'local',
                    'thumbnail'  => str_replace(' ', '+', $thumbnail),
                    'mp4'        => $preview,
                    'length'     => gmdate('H:i:s', $this->video->duration),
                    'category'   => $this->category(),
                    'tags'       => $tags,
                    'metas' => [
                        ['key' => 'source_url', 'value' => $this->video->source_url],
                        ['key' => 'download_link', 'value' => $upload[0]]
                    ]
                ]);

            if (array_has($melody, 'type') && $melody['type'] == 'success') {
                $this->cleanTemp();

                $this->video->finishPosting($this->domain, [
                    'source_id' => $melody['id'],
                    'url' => $melody['url'],
                    'title' => $title,
                    'upload' => $upload[0],
                    'storage_id' => $storage->id
                ]);

                $this->domain->increment('total_posted');
            } else {
                // Delete uploaded file
                (new DeleteUploadedService($storage, $upload[1]))->handle();
                $this->cleanTemp();
            }
        } else {
            $this->video->resetPosted($this->domain);
            $this->cleanTemp();
        }
    }

    public function failed(Exception $exception)
    {
        if ($exception instanceof ServerException) {
            $this->video->resetPosted($this->domain);
        }

        if ($exception instanceof ProcessFailedException) {
            $output = $exception->getProcess()->getErrorOutput();

            if (str_contains($output, 'Retry limit reached')) {
                if (! str_contains($output, "Unexpected state: 'processing'")) {
                    $this->video->resetPosted($this->domain);
                }
            }
        }

        $this->cleanTemp();
    }

    protected function cleanTemp()
    {
        app('files')->deleteDirectory($this->temp);
    }
}

DeleteUploadedService, UploadService are just classes with Symfony\Process calling a docker container.

themsaid commented 6 years ago

@marianvlad please upgrade to 1.2.2 and then report your findings.

ghost commented 6 years ago

Our team just tried with 1.2.2 and we're still having the same issue of the workers being restarted, however we observe differences between this and running normally via terminal window.

dragosperca commented 6 years ago

If killing rogue/child processes sometimes work, sometimes doesn't - a solution would be to introduce a (worker) self-kill option:

This would introduce an extra lookup in Redis for each loop, but the cost is a small one to pay for not having rogue processes.

Also to consider is when the main horizon process is stopped, but rogue processes continue to exist. I suppose an extra check here would be a heart beat issued by the supervisor. If this heart beat is older than a few seconds, workers can safely die.

barryvdh commented 6 years ago

Isn't that exactly what queue:restart did and is now done with the last commits: https://github.com/laravel/horizon/blob/62c60efcb5bcff72562a0b918b18034f5f75fe55/src/Console/TerminateCommand.php#L48

dragosperca commented 6 years ago

I should have checked the last commits, but I believe there's a small difference. This cache check is implemented in the "queue:restart" command, and if this one is not executed, rogue processes will not die.

My thinking was to make this a bit more invasive, executed within the main loop cycle. It would link a worker process to a know token, and workers would die automatically if their parent no longer exists or the token changed.

taylorotwell commented 6 years ago

@TruckersMP-Kat we would need way more information to diagnose anything.

taylorotwell commented 6 years ago

@dragosperca are you currently using 1.2.2?

dragosperca commented 6 years ago

i am using Horizon on a project, and thought about using it for another larger, critical mission one. I don't experience the current issue, I was just reading all the issues and trying to understand failure points. My comment was just an idea to avoid rogue processes.

Before Horizon, we were using the standard worker & supervisor setup and did a similar thing you implemented in 1.2.2, that is, a marker that when set, would make workers die. Since supervisor was there to start them if any process died, we had a "self-kill and revive" process. We implemented it in the main (worker) loop though.

This marker would be set when new code would be deployed to the server.

Anyway, just want to say thank you for the great work you've been doing with Laravel and the ecosystem around it. We all love it!

taylorotwell commented 6 years ago

OK Thanks. We're still looking for any feedback from people running Horizon 1.2.2.

barryvdh commented 6 years ago

On my testserver this seems to be working correctly. I'll deploy it to staging/production tomorrow hopefully and see how it turns out.

Does it matter from which version the termininate command is run, in a zero-downtime environment ala Envoyer (so different release directories). I assume it's safe to just run it from the next release dir (to become current), just before symlinking?

marianvlad commented 6 years ago

Anyone with problems using failed() method inside job to do things like update something in database?

themsaid commented 6 years ago

@barryvdh I don't think it makes a difference.

@marianvlad different topic, different issue please :) This issue is already large and it'd be easier if we just keep it focused on one matter.

taylorotwell commented 6 years ago

Anyone? :smile:

mfn commented 6 years ago

Sorry, soon; hope to upgrade on Monday!

We're basically quite constantly hit by it the every other day, but this week we gave 5.6 a priority first ;-) (PS: Thanks for the new logging infrastructure 👍 ).

aocneanu commented 6 years ago

I updated yesterday Horizon to 1.2.2 on five workers that are doing the same kind of queued jobs. A minute ago I ran terminate and purge on all the five workers, and on three of them I had one orphan process (one on each), the other two were clean. Before the update I had 5 to 10 orphans daily on every worker.