Closed canvural closed 4 years ago
This sounds exactly like my issue I opened a month ago: https://github.com/laravel/framework/issues/32103
The TL;DR was an improper configured retry_after
too low. It does not look like that in your case, but FWIW maybe it helps you debugging this.
One of the things which helped me the most was using the queue events to add absolutely detailed logging (see initializeQueueEvents
my issue description), down to that I've a custom monolog processor also logging the "process ID". But I already had this in place and didn't do it for debugging my issue.
@mfn Yes, I saw your issue. But we always had rety_after
greater than timeout
We are also hooking into the Queue:before, Queue::after and Queue::failed events and logging some info. But there was nothing too useful
Hey there,
Unfortunately we don't support this version anymore. Please check out our support policy on which versions we are currently supporting. Can you please try to upgrade to the latest version and see if your problem persists? We'll help you out and re-open this issue if so.
Thanks!
What you don't support exactly?
Laravel 6 is the LTS version and Horizon 4 works only with Laravel 7, so can't install it right now.
Also Horizon 4 does not have any critical fixes to the core logic. So doubt it fixes the issue.
@driesvints Can you please keep the issue open? Thanks!
The first party libraries aren't LTS, only Laravel itself. If you can reproduce this with Horizon 4 I'll re-open.
https://github.com/laravel/horizon/compare/v3.7.2...v4.3.1
You can see here in the change list, there are no critical changes. There are no bugfixes related to the core logic. Just new features.
So ignoring the issue and just suggesting to upgrade blindly doesn't really make sense in this situation.
I can't be sure about that from just looking at the list. From what I can see from the above this looks like a configuration issue.
Can you first please try one of the support channels below? If you can actually identify this as a bug, feel free to report back and I'll gladly help you out and re-open this issue.
Thanks!
Upgraded to Horizon 4.3.2 and Laravel 7.9.2, problem still exists.
Most of the jobs look like this now. First job "starts", but doesn't actually run (for example log statements inside the job is not ran), then 2 minutes later (our rety_after
value) the second attempt is executed. This time it runs successfully. But the first job still "hangs". And after 30 minutes it is updated like in the pic above.
Did you try any of the support channels from above?
@driesvints Yes, I tried Larachat and Stackoverflow, with no luck.
Can you share the links?
https://stackoverflow.com/questions/61386640/timed-out-job-hangs-for-15-or-30-minutes-and-then-runs
I also just shared the SO link in Laracahat and Twitter.
Hi @canvural, did you find out any solution for this issue? I'm getting the same issue with Laravel 7.6 and Php 7.4 now. My jobs hang for about 2 hours before retrying, but I think this issue related to PDO, because the job always gets stuck at executing a query. Jobs that not use PDO never get stuck. I also open an issue here (it was moved to discussion): https://laracasts.com/discuss/channels/laravel/pdo-hangs-on-executing-query-on-heavy-queue-system-timeout-does-not-work
@canvural @1212087 Did any of you managed to pinpoint the source of the issue?
I am observing the same on multi-queue, redis horizon setup. rety_after
is 90s, timeout
is 60. The jobs work fine until one of them hangs the execution for 940-950 seconds (sometimes a bit outside of this range, but it's always more than 15 minutes). After that time, the job is completed, with a Runtime of 940-950 seconds. All following jobs are then executed within 0.1s. No errors in any logs.
The last time I checked, Horizon didn't log Processing: NameSpace\ClassName
when the problem occurred. Only after said 940-950 seconds, did Processing:
and Processed
messages appear, even though the job in question had Runtime marked as over 940 seconds.
The problem is also fixed when I restart the horizon container, but it's not the best solution. While the jobs are frozen, I can still run php artisan queue:monitor
and see the number of jobs in queues, so the connection to Redis is ok.
After a restart, the problem tends to disappear for a couple of hours.
@rgembalik For us this turned out to be a configuration error in our systems. We were using OpenShift and Docker. We adjusted these values in our containers/systems:
net.ipv4.tcp_keepalive_intvl
net.ipv4.tcp_keepalive_probes
net.ipv4.tcp_keepalive_time
and for now everything works normally.
@canvural That looks very promising, thanks! Do you remember the values that were there originally, and what values did you end up setting in there? I'd like to compare them with our current kubernetes nodes.
@rgembalik I don't remember the old values. But currently we have these:
For anyone who will have the same problem in the future: It turned out I cannot change these sysctl values, as, by default, they are considered "unsafe" for a kubernetes cluster. If anyone else stumbles on that problem without the ability to mess with system settings, you can try to use my workaround to detect stale jobs.
I would love if there was something similar in the Horizon, similar to artisan queue:monitor
, but instead of counting jobs, it would validate if there aren't any jobs pending for longer than X seconds. Since there isn't, I wrote something that does that check for me.
Caution!: The solution you are about to see is hacky and dirty. For best results implement it as a separate artisan command in your code. It's not a valid solution for me at the moment, so I used artisan tinker --execute
.
What the script does is to go through all of the jobs in the Horizon, find the oldest one and check if it is older than 3 * 60 seconds (3 minutes). If it is older, then the script will fail. You can use it as health check, which will cause e.g. horizon container restart.
```bash #!/bin/sh CHECK_BACKLOG_AGE=$(cat <<-"PHP" $repo = App::make('Laravel\Horizon\Repositories\RedisJobRepository'); $jobStartTimes = []; while (count($jobStartTimes) < $repo->countPending()) { $jobs = $repo->getPending(count($jobStartTimes))->all(); if(count($jobs) == 0) { break; } $jobStartTimes = array_merge($jobStartTimes, array_map(function($job){ $date = new \DateTime(); $date->setTimestamp(json_decode($job->payload)->pushedAt); return $date; }, $jobs)); } if(!$jobStartTimes || count($jobStartTimes) == 0) { echo "Jobs are ok".PHP_EOL; return; } arsort($jobStartTimes); $diff = (new \DateTime())->getTimestamp() - array_pop($jobStartTimes)->getTimestamp(); if($diff >= 3 * 60) { echo "Jobs are too old!".PHP_EOL; return; } echo "Jobs are ok".PHP_EOL; PHP ) if php artisan tinker -n --execute "$CHECK_BACKLOG_AGE" | grep "Jobs are too old"; then echo "Backlog is too old" exit 1 fi ```
I got triggered because you mentioned kubernetes and want to share what I've experience so far.
Around end of Q3 last year we started with a secondary system of our primary app to be run on k8s and even without production load, we saw troubles on deployment and properly shutdown workers and horizon and guaranteeing no jobs were lost etc. After a some experimenting I wsa not able to have this stable and I went with pure Laravel workers.
Their way of operating, via supervisor, seems to work a tad better with k8s; at least that's my experience so far.
Exactly facing this issue since Laravel 10 and Horizon 5.
Sometimes the job starts right away as planned, sometimes it needs to wait/hang for 15 minutes before to start.
That's very random, but the 15 minutes is constant, I can't find where that could come from, jobs are in the Horizon dashboard waiting for something to pick them, so I guess workers are not seeing them (and yes I restarted daemon on Forge).
If anyone has a way to debug this, I am on this for a week now.
@incoming-th could you solved this issue?
@incoming-th could you solved this issue?
Yes, in my case it was an infrastructure issue, we used a network load balancer to access the DB. Removing the load balancer fixed the issue.
Description:
We are having strange errors with our Horizon. Basically this is what happens:
retry_after
value) job is retried.Seems like this can happen to any kind of job. For example if it's mailable that is queued, the user gets an email first, then 15 or 30 minutes later user gets another email. Same one.
Here our config files
`config/database.php`: ```php 'redis' => [ 'client' => env('REDIS_CLIENT', 'predis'), 'default' => [ 'host' => env('REDIS_HOST', '127.0.0.1'), 'password' => env('REDIS_PASSWORD', null), 'port' => env('REDIS_PORT', 6379), 'database' => 0, ], ], ``` `config/queue.php`: ```php 'redis' => [ 'driver' => 'redis', 'connection' => 'default', 'queue' => env('DEFAULT_QUEUE_NAME', 'default'), 'retry_after' => 120, // 2 minutes 'block_for' => null, ], ``` `config/horizon.php`: ```php 'environments' => [ 'production' => [ 'supervisor-1' => [ 'connection' => env('HORIZON_CONNECTION', 'database'), 'queue' => [env('DEFAULT_QUEUE_NAME', 'default')], 'balance' => 'simple', 'processes' => 10, 'tries' => 3, 'timeout' => 90, ], ], ] ```Here how it looks in Horizon Dashboard
This when the initial job times out. It stays like this in `Recent Jobs` while the retries are working. ![horizon1](https://user-images.githubusercontent.com/1574232/80081135-d379aa80-8552-11ea-8d10-aea79c16a186.png) After almost half an hour it changes to this: ![horizon2](https://user-images.githubusercontent.com/1574232/80081160-dc6a7c00-8552-11ea-8c4d-aa3bc3cc91e1.png) It's the same tags, I just blacked out names.Here are the logs we are seeing (times here are in UTC)
`[2020-04-22 11:24:59][88] Processing: App\Mail\ReservationInformation` `[2020-04-22 11:29:00][88] Failed: App\Mail\ReservationInformation` `[2020-04-22 11:29:00][88] Processing: App\Mail\ReservationInformation` `[2020-04-22 11:56:21][88] Processed: App\Mail\ReservationInformation` **Note**: With Predis we also see some logs like `Error while reading line from the server. [tcp://REDIS_HOST:6379]` but with PHPRedis there was none.Here is an example JSON returned from Horizon endpoint /horizon/api/jobs/failed/332
```JSON { "JSON":{ "id":"332", "connection":"redis", "queue":"newquebe1p", "name":"App\\Jobs\\SomeJob", "status":"failed", "payload":{ "type":"job", "timeout":null, "tags":[ "Some tags" ], "id":"332", "data":{ "command":"....s:5:\"tries\";i:1;s:6:\"\u0000*\u0000job\";N;s:10:\"connection\";N;s:5:\"queue\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}}", "commandName":"App\\Jobs\\SomeJob" }, "displayName":"App\\Jobs\\SomeJob", "timeoutAt":null, "pushedAt":"1587561267.9688", "job":"Illuminate\\Queue\\CallQueuedHandler@call", "maxTries":1, "delay":null, "attempts":2 }, "exception":"Illuminate\\Queue\\MaxAttemptsExceededException: App\\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. in /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:632\nStack trace:\n#0 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(446): Illuminate\\Queue\\Worker->maxAttemptsExceededException(Object(Illuminate\\Queue\\Jobs\\RedisJob))\n#1 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(358): Illuminate\\Queue\\Worker->markJobAsFailedIfAlreadyExceedsMaxAttempts('redis', Object(Illuminate\\Queue\\Jobs\\RedisJob), 1)\n#2 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(314): Illuminate\\Queue\\Worker->process('redis', Object(Illuminate\\Queue\\Jobs\\RedisJob), Object(Illuminate\\Queue\\WorkerOptions))\n#3 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(134): Illuminate\\Queue\\Worker->runJob(Object(Illuminate\\Queue\\Jobs\\RedisJob), 'redis', Object(Illuminate\\Queue\\WorkerOptions))\n#4 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(112): Illuminate\\Queue\\Worker->daemon('redis', 'floquebe1p', Object(Illuminate\\Queue\\WorkerOptions))\n#5 /var/www/html/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(96): Illuminate\\Queue\\Console\\WorkCommand->runWorker('redis', 'floquebe1p')\n#6 /var/www/html/vendor/laravel/horizon/src/Console/WorkCommand.php(46): Illuminate\\Queue\\Console\\WorkCommand->handle()\n#7 [internal function]: Laravel\\Horizon\\Console\\WorkCommand->handle()\n#8 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): call_user_func_array(Array, Array)\n#9 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\\Container\\BoundMethod::Illuminate\\Container\\{closure}()\n#10 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\\Container\\Util::unwrapIfClosure(Object(Closure))\n#11 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\\Container\\BoundMethod::callBoundMethod(Object(Illuminate\\Foundation\\Application), Array, Object(Closure))\n#12 /var/www/html/vendor/laravel/framework/src/Illuminate/Container/Container.php(590): Illuminate\\Container\\BoundMethod::call(Object(Illuminate\\Foundation\\Application), Array, Array, NULL)\n#13 /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Command.php(173): Illuminate\\Container\\Container->call(Array)\n#14 /var/www/html/vendor/symfony/console/Command/Command.php(255): Illuminate\\Console\\Command->execute(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))\n#15 /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Command.php(160): Symfony\\Component\\Console\\Command\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Illuminate\\Console\\OutputStyle))\n#16 /var/www/html/vendor/symfony/console/Application.php(1011): Illuminate\\Console\\Command->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#17 /var/www/html/vendor/symfony/console/Application.php(272): Symfony\\Component\\Console\\Application->doRunCommand(Object(Laravel\\Horizon\\Console\\WorkCommand), Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#18 /var/www/html/vendor/symfony/console/Application.php(148): Symfony\\Component\\Console\\Application->doRun(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#19 /var/www/html/vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Symfony\\Component\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#20 /var/www/html/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(131): Illuminate\\Console\\Application->run(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#21 /var/www/html/artisan(37): Illuminate\\Foundation\\Console\\Kernel->handle(Object(Symfony\\Component\\Console\\Input\\ArgvInput), Object(Symfony\\Component\\Console\\Output\\ConsoleOutput))\n#22 {main}", "failed_at":"1587562328.2697", "completed_at":"1587562207.4457", "retried_by":[ ], "reserved_at":"1587561389.885", "index":0 }, } ``` This is not the same job as in the screenshots. But I thought it might be still useful. Specially timestamps.We tried a lot of different combinations, to eliminate the problem. And it happened in every combination. So we think it must be something with Horizon.
We tried:
read_write_timeout
valuesThere is only one instance of Horizon running. And no other queue is handled in this Horizon instance.
Steps To Reproduce:
This is the hard part. This happens randomly. And locally we can't reproduce it.
Any information or tips to try are welcome!