laravel / horizon

Dashboard and code-driven configuration for Laravel queues.
https://laravel.com/docs/horizon
MIT License
3.87k stars 657 forks source link

Timed out job hangs for 15 or 30 minutes and then runs #833

Closed canvural closed 4 years ago

canvural commented 4 years ago

Description:

We are having strange errors with our Horizon. Basically this is what happens:

Seems like this can happen to any kind of job. For example if it's mailable that is queued, the user gets an email first, then 15 or 30 minutes later user gets another email. Same one.

Here our config files `config/database.php`: ```php 'redis' => [ 'client' => env('REDIS_CLIENT', 'predis'), 'default' => [ 'host' => env('REDIS_HOST', '127.0.0.1'), 'password' => env('REDIS_PASSWORD', null), 'port' => env('REDIS_PORT', 6379), 'database' => 0, ], ], ``` `config/queue.php`: ```php 'redis' => [ 'driver' => 'redis', 'connection' => 'default', 'queue' => env('DEFAULT_QUEUE_NAME', 'default'), 'retry_after' => 120, // 2 minutes 'block_for' => null, ], ``` `config/horizon.php`: ```php 'environments' => [ 'production' => [ 'supervisor-1' => [ 'connection' => env('HORIZON_CONNECTION', 'database'), 'queue' => [env('DEFAULT_QUEUE_NAME', 'default')], 'balance' => 'simple', 'processes' => 10, 'tries' => 3, 'timeout' => 90, ], ], ] ```
Here how it looks in Horizon Dashboard This when the initial job times out. It stays like this in `Recent Jobs` while the retries are working. ![horizon1](https://user-images.githubusercontent.com/1574232/80081135-d379aa80-8552-11ea-8d10-aea79c16a186.png) After almost half an hour it changes to this: ![horizon2](https://user-images.githubusercontent.com/1574232/80081160-dc6a7c00-8552-11ea-8c4d-aa3bc3cc91e1.png) It's the same tags, I just blacked out names.
Here are the logs we are seeing (times here are in UTC) `[2020-04-22 11:24:59][88] Processing: App\Mail\ReservationInformation` `[2020-04-22 11:29:00][88] Failed: App\Mail\ReservationInformation` `[2020-04-22 11:29:00][88] Processing: App\Mail\ReservationInformation` `[2020-04-22 11:56:21][88] Processed: App\Mail\ReservationInformation` **Note**: With Predis we also see some logs like `Error while reading line from the server. [tcp://REDIS_HOST:6379]` but with PHPRedis there was none.
Here is an example JSON returned from Horizon endpoint /horizon/api/jobs/failed/332 ```JSON { "JSON":{ "id":"332", "connection":"redis", "queue":"newquebe1p", "name":"App\\Jobs\\SomeJob", "status":"failed", "payload":{ "type":"job", "timeout":null, "tags":[ "Some tags" ], "id":"332", "data":{ "command":"....s:5:\"tries\";i:1;s:6:\"\u0000*\u0000job\";N;s:10:\"connection\";N;s:5:\"queue\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}}", "commandName":"App\\Jobs\\SomeJob" }, "displayName":"App\\Jobs\\SomeJob", "timeoutAt":null, "pushedAt":"1587561267.9688", "job":"Illuminate\\Queue\\CallQueuedHandler@call", "maxTries":1, "delay":null, "attempts":2 }, "exception":"Illuminate\\Queue\\MaxAttemptsExceededException: App\\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. in /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:632\nStack trace:\n#0 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(446): Illuminate\\Queue\\Worker->maxAttemptsExceededException(Object(Illuminate\\Queue\\Jobs\\RedisJob))\n#1 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(358): Illuminate\\Queue\\Worker->markJobAsFailedIfAlreadyExceedsMaxAttempts('redis', Object(Illuminate\\Queue\\Jobs\\RedisJob), 1)\n#2 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(314): Illuminate\\Queue\\Worker->process('redis', Object(Illuminate\\Queue\\Jobs\\RedisJob), Object(Illuminate\\Queue\\WorkerOptions))\n#3 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(134): Illuminate\\Queue\\Worker->runJob(Object(Illuminate\\Queue\\Jobs\\RedisJob), 'redis', Object(Illuminate\\Queue\\WorkerOptions))\n#4 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(112): Illuminate\\Queue\\Worker->daemon('redis', 'floquebe1p', Object(Illuminate\\Queue\\WorkerOptions))\n#5 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(96): Illuminate\\Queue\\Console\\WorkCommand->runWorker('redis', 'floquebe1p')\n#6 /var/www/html/vendor/laravel/horizon/src/Console/WorkCommand.php(46): Illuminate\\Queue\\Console\\WorkCommand->handle()\n#7 [internal function]: Laravel\\Horizon\\Console\\WorkCommand->handle()\n#8 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): call_user_func_array(Array, Array)\n#9 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()\n#10 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\\Container\\Util::unwrapIfClosure(Object(Closure))\n#11 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\\Container\\BoundMethod::callBoundMethod(Object(Illuminate\\Foundation\\Application), Array, Object(Closure))\n#12 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Container.php(590): Illuminate\\Container\\BoundMethod::call(Object(Illuminate\\Foundation\\Application), Array, Array, NULL)\n#13 /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Command.php(173): Illuminate\\Container\\Container->call(Array)\n#14 /var/www/html/vendor/symfony/console/Command/Command.php(255): Illuminate\\Console\\Command->execute(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))\n#15 /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Command.php(160): Symfony\\Component\\Console\\Command\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))\n#16 /var/www/html/vendor/symfony/console/Application.php(1011): Illuminate\\Console\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#17 /var/www/html/vendor/symfony/console/Application.php(272): Symfony\\Component\\Console\\Application->doRunCommand(Object(Laravel\\Horizon\\Console\\WorkCommand), Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#18 /var/www/html/vendor/symfony/console/Application.php(148): Symfony\\Component\\Console\\Application->doRun(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#19 /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Symfony\\Component\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#20 /var/www/html/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(131): Illuminate\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#21 /var/www/html/artisan(37): Illuminate\\Foundation\\Console\\Kernel->handle(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#22 {main}", "failed_at":"1587562328.2697", "completed_at":"1587562207.4457", "retried_by":[ ], "reserved_at":"1587561389.885", "index":0 }, } ``` This is not the same job as in the screenshots. But I thought it might be still useful. Specially timestamps.

We tried a lot of different combinations, to eliminate the problem. And it happened in every combination. So we think it must be something with Horizon.

We tried:

There is only one instance of Horizon running. And no other queue is handled in this Horizon instance.

Steps To Reproduce:

This is the hard part. This happens randomly. And locally we can't reproduce it.

Any information or tips to try are welcome!

mfn commented 4 years ago

This sounds exactly like my issue I opened a month ago: https://github.com/laravel/framework/issues/32103

The TL;DR was an improper configured retry_after too low. It does not look like that in your case, but FWIW maybe it helps you debugging this.

mfn commented 4 years ago

One of the things which helped me the most was using the queue events to add absolutely detailed logging (see initializeQueueEvents my issue description), down to that I've a custom monolog processor also logging the "process ID". But I already had this in place and didn't do it for debugging my issue.

canvural commented 4 years ago

@mfn Yes, I saw your issue. But we always had rety_after greater than timeout

We are also hooking into the Queue:before, Queue::after and Queue::failed events and logging some info. But there was nothing too useful

driesvints commented 4 years ago

Hey there,

Unfortunately we don't support this version anymore. Please check out our support policy on which versions we are currently supporting. Can you please try to upgrade to the latest version and see if your problem persists? We'll help you out and re-open this issue if so.

Thanks!

canvural commented 4 years ago

What you don't support exactly?

Laravel 6 is the LTS version and Horizon 4 works only with Laravel 7, so can't install it right now.

Also Horizon 4 does not have any critical fixes to the core logic. So doubt it fixes the issue.

canvural commented 4 years ago

@driesvints Can you please keep the issue open? Thanks!

driesvints commented 4 years ago

The first party libraries aren't LTS, only Laravel itself. If you can reproduce this with Horizon 4 I'll re-open.

canvural commented 4 years ago

https://github.com/laravel/horizon/compare/v3.7.2...v4.3.1

You can see here in the change list, there are no critical changes. There are no bugfixes related to the core logic. Just new features.

So ignoring the issue and just suggesting to upgrade blindly doesn't really make sense in this situation.

driesvints commented 4 years ago

I can't be sure about that from just looking at the list. From what I can see from the above this looks like a configuration issue.

Can you first please try one of the support channels below? If you can actually identify this as a bug, feel free to report back and I'll gladly help you out and re-open this issue.

Thanks!

canvural commented 4 years ago

Upgraded to Horizon 4.3.2 and Laravel 7.9.2, problem still exists.

canvural commented 4 years ago

image

Most of the jobs look like this now. First job "starts", but doesn't actually run (for example log statements inside the job is not ran), then 2 minutes later (our rety_after value) the second attempt is executed. This time it runs successfully. But the first job still "hangs". And after 30 minutes it is updated like in the pic above.

driesvints commented 4 years ago

Did you try any of the support channels from above?

canvural commented 4 years ago

@driesvints Yes, I tried Larachat and Stackoverflow, with no luck.

driesvints commented 4 years ago

Can you share the links?

canvural commented 4 years ago

https://stackoverflow.com/questions/61386640/timed-out-job-hangs-for-15-or-30-minutes-and-then-runs

I also just shared the SO link in Laracahat and Twitter.

1212087 commented 4 years ago

Hi @canvural, did you find out any solution for this issue? I'm getting the same issue with Laravel 7.6 and Php 7.4 now. My jobs hang for about 2 hours before retrying, but I think this issue related to PDO, because the job always gets stuck at executing a query. Jobs that not use PDO never get stuck. I also open an issue here (it was moved to discussion): https://laracasts.com/discuss/channels/laravel/pdo-hangs-on-executing-query-on-heavy-queue-system-timeout-does-not-work

rgembalik commented 2 years ago

@canvural @1212087 Did any of you managed to pinpoint the source of the issue?

I am observing the same on multi-queue, redis horizon setup. rety_after is 90s, timeout is 60. The jobs work fine until one of them hangs the execution for 940-950 seconds (sometimes a bit outside of this range, but it's always more than 15 minutes). After that time, the job is completed, with a Runtime of 940-950 seconds. All following jobs are then executed within 0.1s. No errors in any logs.

The last time I checked, Horizon didn't log Processing: NameSpace\ClassName when the problem occurred. Only after said 940-950 seconds, did Processing: and Processed messages appear, even though the job in question had Runtime marked as over 940 seconds.

The problem is also fixed when I restart the horizon container, but it's not the best solution. While the jobs are frozen, I can still run php artisan queue:monitor and see the number of jobs in queues, so the connection to Redis is ok.

After a restart, the problem tends to disappear for a couple of hours.

canvural commented 2 years ago

@rgembalik For us this turned out to be a configuration error in our systems. We were using OpenShift and Docker. We adjusted these values in our containers/systems:

net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time

and for now everything works normally.

rgembalik commented 2 years ago

@canvural That looks very promising, thanks! Do you remember the values that were there originally, and what values did you end up setting in there? I'd like to compare them with our current kubernetes nodes.

canvural commented 2 years ago

@rgembalik I don't remember the old values. But currently we have these:

image

rgembalik commented 2 years ago

For anyone who will have the same problem in the future: It turned out I cannot change these sysctl values, as, by default, they are considered "unsafe" for a kubernetes cluster. If anyone else stumbles on that problem without the ability to mess with system settings, you can try to use my workaround to detect stale jobs.

I would love if there was something similar in the Horizon, similar to artisan queue:monitor, but instead of counting jobs, it would validate if there aren't any jobs pending for longer than X seconds. Since there isn't, I wrote something that does that check for me.

Caution!: The solution you are about to see is hacky and dirty. For best results implement it as a separate artisan command in your code. It's not a valid solution for me at the moment, so I used artisan tinker --execute.

What the script does is to go through all of the jobs in the Horizon, find the oldest one and check if it is older than 3 * 60 seconds (3 minutes). If it is older, then the script will fail. You can use it as health check, which will cause e.g. horizon container restart.

Liveness (healthcheck) probe script for horizon... Use at your own discretion

```bash #!/bin/sh CHECK_BACKLOG_AGE=$(cat <<-"PHP" $repo = App::make('Laravel\Horizon\Repositories\RedisJobRepository'); $jobStartTimes = []; while (count($jobStartTimes) < $repo->countPending()) { $jobs = $repo->getPending(count($jobStartTimes))->all(); if(count($jobs) == 0) { break; } $jobStartTimes = array_merge($jobStartTimes, array_map(function($job){ $date = new \DateTime(); $date->setTimestamp(json_decode($job->payload)->pushedAt); return $date; }, $jobs)); } if(!$jobStartTimes || count($jobStartTimes) == 0) { echo "Jobs are ok".PHP_EOL; return; } arsort($jobStartTimes); $diff = (new \DateTime())->getTimestamp() - array_pop($jobStartTimes)->getTimestamp(); if($diff >= 3 * 60) { echo "Jobs are too old!".PHP_EOL; return; } echo "Jobs are ok".PHP_EOL; PHP ) if php artisan tinker -n --execute "$CHECK_BACKLOG_AGE" | grep "Jobs are too old"; then echo "Backlog is too old" exit 1 fi ```

mfn commented 2 years ago

I got triggered because you mentioned kubernetes and want to share what I've experience so far.

Around end of Q3 last year we started with a secondary system of our primary app to be run on k8s and even without production load, we saw troubles on deployment and properly shutdown workers and horizon and guaranteeing no jobs were lost etc. After a some experimenting I wsa not able to have this stable and I went with pure Laravel workers.

Their way of operating, via supervisor, seems to work a tad better with k8s; at least that's my experience so far.

incoming-th commented 1 year ago

Exactly facing this issue since Laravel 10 and Horizon 5.

Sometimes the job starts right away as planned, sometimes it needs to wait/hang for 15 minutes before to start.

That's very random, but the 15 minutes is constant, I can't find where that could come from, jobs are in the Horizon dashboard waiting for something to pick them, so I guess workers are not seeing them (and yes I restarted daemon on Forge).

If anyone has a way to debug this, I am on this for a week now.

mac-pasha commented 11 months ago

@incoming-th could you solved this issue?

incoming-th commented 11 months ago

@incoming-th could you solved this issue?

Yes, in my case it was an infrastructure issue, we used a network load balancer to access the DB. Removing the load balancer fixed the issue.