Closed mfn closed 4 years ago
What's also strange:
In the meantime I was able to collect (from multiple jobs, same queue) other examples but their timestamps don't connect to the "maybe 30s timeout" correlation I almost made.
One example (shortened for readability):
2020-03-25T11:52:30.747052+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T11:54:00.327897+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T11:54:00.332469+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed
[2020-03-25T11:54:00.333052+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred
[2020-03-25T11:56:38.811077+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed
Another example:
[2020-03-25T18:04:44.818191+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T18:06:14.876755+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T18:06:14.879100+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed
[2020-03-25T18:06:14.879315+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred
[2020-03-25T18:08:52.875876+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed
I did omit it, but the process_id
patterns match as in the initial description (i.e. first and last are the same process job, the others in between are from another process)
In the code of one of the jobs I found that I catch an Exception
somewhere. This catch block does not propagate the exception further as it's handled locally and, even in case of exception, this job is expected to successfully return.
Could a "thrown but not bubbled" exception somehow interfere here?
Man, I did a lot of local testing, setting env vars to production, disabling debug, use the same sentry setup, simulate the job failing in the same way: yet this only happens in production 🙈
Hey man. Pretty swamped atm so don't have time to deep dive into this. Maybe @themsaid can help out?
Yeah sure, no pressure. I'm grateful for any tips how to further debug this 🙏
Don't you just have a too low retry_after queue config value for your job?
Your jobs seem to be retried after 90 seconds (which is the default config value in queue.php) while they seem to run for about 4 minutes, so they're automatically being retried while the 1st instance of the job is still running.
Maybe try increasing the retry_after value in your queue.php config file?
See: https://laravel.com/docs/7.x/queues#job-expirations-and-timeouts https://github.com/laravel/laravel/blob/master/config/queue.php
Oh.my.god
From conf/queue.php
:
'redis' => [
'driver' => 'redis',
'connection' => 'queue',
'queue' => 'default',
'retry_after' => 90,
'block_for' => 3,
],
Big 🤦♂️
TBH I was totally oblivious of that. I manage >20 queues via horizon and basically never checked the config/queue.php
since ages.
Having many queues with different strategies, it seems to me I should rather disable it?
Found people which similar or puzzling issues, why a job which is still actively running would already be retried:
retry_after
is defined on the connection and the timeout
is per queueretry_after
, only global (on the respective queue connection)I also found https://github.com/laravel/framework/issues/31562#issuecomment-590035918 which explains something I also wasn't ware of:
retry_after
the config setting is meant for the whole queue (and this is stored on \Illuminate\Queue\RedisQueue::$retryAfter
)$retryAfter
on the Job (probably means \Illuminate\Events\CallQueuedListener::$retryAfter
) which is about delaying retrying a failed job(or better: TL I still should hve read)
Make sure your connection retry_after
is higher then your highest queue timeout; or disable it altogether!
There's no explicit information but reading the source, it seems setting it to null
is the way to disable it https://github.com/laravel/framework/blob/7.x/src/Illuminate/Queue/RedisQueue.php#L193-L195
protected function migrate($queue)
{
$this->migrateExpiredJobs($queue.':delayed', $queue);
if (! is_null($this->retryAfter)) {
$this->migrateExpiredJobs($queue.':reserved', $queue);
}
}
Fun fact, weapon with all this knowledge, I actually can't reproduce this:
retry_after
to 5sretry_after
🤷♀️When I try to understand the flow in \Illuminate\Queue\Worker::daemon
which calls \Illuminate\Queue\Worker::getNextJob
which does this:
…
if (! is_null($job = $connection->pop($queue))) {
return $job;
…
The ->pop()
method might call \Illuminate\Queue\RedisQueue::retrieveNextJob
which is the one honoring the connection retry_after
.
But most of my queues, or at least the ones where I observed this, have only one worker which is either a) working or b) looking for a job, how could that one worker right in the middle of processing a job detect the retry_after
🤔
Not sure you can set $retryAfter to null: https://github.com/laravel/framework/blob/7.x/src/Illuminate/Queue/Connectors/RedisConnector.php#L48
public function connect(array $config)
{
return new RedisQueue(
$this->redis, $config['queue'],
$config['connection'] ?? $this->connection,
$config['retry_after'] ?? 60, // <- this here
$config['block_for'] ?? null
);
}
What I do is I use multiple queue connections with different retry_after values based on the expected timeout I will allow on the jobs. Then I also have multiple horizon supervisors for these queue connections.
how could that one worker right in the middle of processing a job detect the
retry_after
Weird. Maybe it's because of your processes config value in horizon.php?
I'm able to reproduce it here: https://github.com/wfeller/test-retry-after
php artisan horizon
php artisan dispatch-test-job
If you run php artisan dispatch-test-job --times=10
(more than max processes in horizon.php), you'll see that some of your jobs don't get retried.
$config['retry_after'] ?? 60, // <- this here
Ouch!
But this also means that in practice this check below from \Illuminate\Queue\RedisQueue::migrate
won't do anything, unless you've a custom connector :-/
if (! is_null($this->retryAfter)) {
$this->migrateExpiredJobs($queue.':reserved', $queue);
}
Oh lol, it does work setting to null
in horizon, because there the code is different, see https://github.com/laravel/horizon/blob/e01c4a3f8bf88046479163b6b74eef1df4165f76/src/Connectors/RedisConnector.php#L22
return new RedisQueue(
$this->redis, $config['queue'],
Arr::get($config, 'connection', $this->connection),
Arr::get($config, 'retry_after', 60),
Arr::get($config, 'block_for', null)
);
Sorry for not understanding everything exactly but is there anything we should fix in either Horizon or the docs? Anything we can do to make things more clear?
Still working on it 😄
@wfeller disabling via null
might actually not work though because \Illuminate\Queue\RedisQueue::retrieveNextJob
:
$nextJob = $this->getConnection()->eval(
LuaScripts::pop(), 3, $queue, $queue.':reserved', $queue.':notify',
$this->availableAt($this->retryAfter)
);
if $this->retryAfter
is null
it will be treated like 0
in availableAt
which means the job might be marked as being immediately available.
I will got for an insanely high value for now, seems safer!
Basically, everything has been said, yes: the culprit was the retry_after
being set to low!
Increased the value to an insane number to effectively disable it as I don't need this feature.
Thanks everyone, especially @wfeller for their feedback, much appreciated 🙇
No problem @mfn 👍
What I do now is I use multiple queue connections with different retry_after values, and multiple horizon supervisors to correctly manage queues.
In queue.php:
'connections' => [
'redis' => [
'driver' => 'redis',
'connection' => 'default',
'queue' => 'default',
'retry_after' => 90,
],
'redis-slow' => [
'driver' => 'redis',
'connection' => 'default',
'queue' => 'default-slow',
'retry_after' => 1830, // Max $timeout of 1800 seconds in my jobs
],
],
In horizon.php:
'production' => [
'supervisor' => [
'connection' => 'redis',
'queue' => ['default', /* other normal queues */],
'timeout' => 60,
],
'supervisor-slow' => [
'connection' => 'redis-slow',
'queue' => ['default-slow', /* other slow queues */],
'timeout' => 1800,
],
],
Then I'll just use $this->onQueue('default-slow')
in my slow jobs' constructors (you can probably also use $this->onConnection('redis-slow')
instead)
@mfn You sure this is issue should be closed? You might have resolved it by using an extremely large timeout but the fact that Horizon does support setting it to null and the queue not seems like abnormal behavior to me.
Description:
Today I received a strange sentry error I cannot explain, this is the stacktrace:
Ok, please bear with me, this alone doesn't look strange.
Horizon queue configuration (that specific queue the job ran on):
To have as much insights as possible with our queued jobs, I'm using this in my
AppServiceProvider
(this will later explain why I've the logs I have):So as you can see, there will be very detailed logging.
And this is what I was able to log (unfortunately very very long lines):
I tried to reduce this to the bare minimal were I discovered something interesting (I stripped out the job ID, but they are all the same!):
tries=1
So it looks like (looking at the timestamps), almost 30s (but not exactly, but maybe ok) later the job failed ("timed out"?) and started a new one, which immediately failed (expected, tries=1).
But the original was still running?
There were no deployments, code changes or system changes when this happened, i.e. the very horizon supervisor was running for some time already.
Via sentry I also have the information how the worker, which generated this error, was started:
The two issues I'm facing:
Can someone assist in digging into this?
Note regarding reproducibility: hardly, but it happens from time to time