[QUEUE] Job aborted for unclear reasons, retried altough shouldn't, but was still running?

mfn commented 4 years ago

Laravel Version: 7.2.2
PHP Version: 7.4.3
Database Driver & Version: Pgsql 9.6
Horizon: 4.0.2 Note: not sure if it's horizon related or a queue thing, thus opening it in laravel/framework for now!

Description:

Today I received a strange sentry error I cannot explain, this is the stacktrace:

Illuminate\Queue\MaxAttemptsExceededException: App\Jobs\SomeJob has been attempted too many times or run too long. The job may have previously timed out.
#22 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(647): Illuminate\Queue\Worker::maxAttemptsExceededException
#21 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(434): Illuminate\Queue\Worker::markJobAsFailedIfAlreadyExceedsMaxAttempts
#20 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(346): Illuminate\Queue\Worker::process
#19 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(306): Illuminate\Queue\Worker::runJob
#18 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(132): Illuminate\Queue\Worker::daemon
#17 /vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(112): Illuminate\Queue\Console\WorkCommand::runWorker
#16 /vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(96): Illuminate\Queue\Console\WorkCommand::handle
#15 /vendor/laravel/horizon/src/Console/WorkCommand.php(46): Laravel\Horizon\Console\WorkCommand::handle
#14 [internal](0): call_user_func_array
#13 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): Illuminate\Container\BoundMethod::Illuminate\Container\{closure}
#12 /vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\Container\Util::unwrapIfClosure
#11 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\Container\BoundMethod::callBoundMethod
#10 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\Container\BoundMethod::call
#9 /vendor/laravel/framework/src/Illuminate/Container/Container.php(592): Illuminate\Container\Container::call
#8 /vendor/laravel/framework/src/Illuminate/Console/Command.php(134): Illuminate\Console\Command::execute
#7 /vendor/symfony/console/Command/Command.php(255): Symfony\Component\Console\Command\Command::run
#6 /vendor/laravel/framework/src/Illuminate/Console/Command.php(121): Illuminate\Console\Command::run
#5 /vendor/symfony/console/Application.php(912): Symfony\Component\Console\Application::doRunCommand
#4 /vendor/symfony/console/Application.php(264): Symfony\Component\Console\Application::doRun
#3 /vendor/symfony/console/Application.php(140): Symfony\Component\Console\Application::run
#2 /vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Illuminate\Console\Application::run
#1 /vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(130): Illuminate\Foundation\Console\Kernel::handle
#0 /artisan(37): null

Ok, please bear with me, this alone doesn't look strange.

Horizon queue configuration (that specific queue the job ran on):

            'jobqueue' => [
                'balance' => 'simple',
                'connection' => 'redis',
                'processes' => 2,
                'queue' => ['jobqueue'],
                'sleep' => 0,
                'timeout' => 0,
                'tries' => 1,
            ],

To have as much insights as possible with our queued jobs, I'm using this in my AppServiceProvider (this will later explain why I've the logs I have):

    protected function initializeQueueEvents(): void
    {
        Queue::before(function (JobProcessing $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $logger->info(\get_class($event), $this->getJobDetails($queueName, $job));
        });

        Queue::after(function (JobProcessed $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $logger->info(\get_class($event), $this->getJobDetails($queueName, $job));
        });

        Queue::exceptionOccurred(function (JobExceptionOccurred $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $details = $this->getJobDetails($queueName, $job);
            $details['exception'] = $event->exception;
            $logger->error(\get_class($event), $details);

            $this->reportOtherwiseIgnoredException($event->exception);
        });

        Queue::failing(function (JobFailed $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $details = $this->getJobDetails($queueName, $job);
            $details['exception'] = $event->exception;
            $logger->error(\get_class($event), $details);
        });
    }

So as you can see, there will be very detailed logging.

And this is what I was able to log (unfortunately very very long lines):

[2020-03-25T09:36:42.579873+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing           {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":0,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67079}
[2020-03-25T09:38:13.177079+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing           {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67077}
[2020-03-25T09:38:13.179925+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed              {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}","exception":"[object] (Illuminate\\Queue\\MaxAttemptsExceededException(code: 0): App\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. at /project/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:647)"} {"process_id":67077}
[2020-03-25T09:38:13.180380+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred   {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}","exception":"[object] (Illuminate\\Queue\\MaxAttemptsExceededException(code: 0): App\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. at /project/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:647)"} {"process_id":67077}
[2020-03-25T09:40:50.637754+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed            {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":0,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67079}

I tried to reduce this to the bare minimal were I discovered something interesting (I stripped out the job ID, but they are all the same!):

process 67079 starts processing the job
process 67077 starts processing the job
- fails and aborts because tries=1
process 67079 finishes processing 🤪

So it looks like (looking at the timestamps), almost 30s (but not exactly, but maybe ok) later the job failed ("timed out"?) and started a new one, which immediately failed (expected, tries=1).

But the original was still running?

There were no deployments, code changes or system changes when this happened, i.e. the very horizon supervisor was running for some time already.

Via sentry I also have the information how the worker, which generated this error, was started:

Starting Artisan command: horizon:work

'horizon:work' redis --delay=0 --memory=128 --queue=jobqueue --sleep=0 --timeout=0 --tries=1 --supervisor='server-dlsK:jobqueue'

The two issues I'm facing:

unclear what the actual reason was?
seems like the original process was still running when a new one was tried to be dispatched?

Can someone assist in digging into this?

Note regarding reproducibility: hardly, but it happens from time to time

mfn commented 4 years ago

What's also strange:

sentry did catch this exception and via this service I got the notification, etc.
nothing in the laravel.log (error log), only in my custom queue logs

mfn commented 4 years ago

In the meantime I was able to collect (from multiple jobs, same queue) other examples but their timestamps don't connect to the "maybe 30s timeout" correlation I almost made.

One example (shortened for readability):

2020-03-25T11:52:30.747052+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T11:54:00.327897+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T11:54:00.332469+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed
[2020-03-25T11:54:00.333052+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred
[2020-03-25T11:56:38.811077+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed

Another example:

[2020-03-25T18:04:44.818191+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T18:06:14.876755+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T18:06:14.879100+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed
[2020-03-25T18:06:14.879315+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred
[2020-03-25T18:08:52.875876+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed

I did omit it, but the process_id patterns match as in the initial description (i.e. first and last are the same process job, the others in between are from another process)

mfn commented 4 years ago

In the code of one of the jobs I found that I catch an Exception somewhere. This catch block does not propagate the exception further as it's handled locally and, even in case of exception, this job is expected to successfully return.

Could a "thrown but not bubbled" exception somehow interfere here?

mfn commented 4 years ago

Man, I did a lot of local testing, setting env vars to production, disabling debug, use the same sentry setup, simulate the job failing in the same way: yet this only happens in production 🙈

driesvints commented 4 years ago

Hey man. Pretty swamped atm so don't have time to deep dive into this. Maybe @themsaid can help out?

mfn commented 4 years ago

Yeah sure, no pressure. I'm grateful for any tips how to further debug this 🙏

wfeller commented 4 years ago

Don't you just have a too low retry_after queue config value for your job?

Your jobs seem to be retried after 90 seconds (which is the default config value in queue.php) while they seem to run for about 4 minutes, so they're automatically being retried while the 1st instance of the job is still running.

Maybe try increasing the retry_after value in your queue.php config file?

See: https://laravel.com/docs/7.x/queues#job-expirations-and-timeouts https://github.com/laravel/laravel/blob/master/config/queue.php

mfn commented 4 years ago

Oh.my.god

From conf/queue.php:

        'redis' => [
            'driver' => 'redis',
            'connection' => 'queue',
            'queue' => 'default',
            'retry_after' => 90,
            'block_for' => 3,
        ],

Big 🤦‍♂️

TBH I was totally oblivious of that. I manage >20 queues via horizon and basically never checked the config/queue.php since ages.

Having many queues with different strategies, it seems to me I should rather disable it?

Found people which similar or puzzling issues, why a job which is still actively running would already be retried:

https://github.com/laravel/framework/issues/15696
https://github.com/laravel/framework/issues/30264#issuecomment-543158405
https://github.com/laravel/framework/issues/28920#issuecomment-517172208 This one also talks about the confusion that retry_after is defined on the connection and the timeout is per queue
https://github.com/laravel/framework/issues/27832#issuecomment-473866527 Also emphasizes that you can't define a per-queue retry_after, only global (on the respective queue connection)
https://github.com/laravel/framework/issues/23728#issuecomment-382457039

I also found https://github.com/laravel/framework/issues/31562#issuecomment-590035918 which explains something I also wasn't ware of:

retry_after the config setting is meant for the whole queue (and this is stored on \Illuminate\Queue\RedisQueue::$retryAfter)
and there's a $retryAfter on the Job (probably means \Illuminate\Events\CallQueuedListener::$retryAfter) which is about delaying retrying a failed job

TL;DR

(or better: TL I still should hve read) Make sure your connection retry_after is higher then your highest queue timeout; or disable it altogether!

mfn commented 4 years ago

There's no explicit information but reading the source, it seems setting it to null is the way to disable it https://github.com/laravel/framework/blob/7.x/src/Illuminate/Queue/RedisQueue.php#L193-L195

    protected function migrate($queue)
    {
        $this->migrateExpiredJobs($queue.':delayed', $queue);

        if (! is_null($this->retryAfter)) {
            $this->migrateExpiredJobs($queue.':reserved', $queue);
        }
    }

mfn commented 4 years ago

Fun fact, weapon with all this knowledge, I actually can't reproduce this:

set connection retry_after to 5s
made sure horizon is restarted
enqueued a job with sleep 30
job was not tried again after retry_after 🤷‍♀️

When I try to understand the flow in \Illuminate\Queue\Worker::daemon which calls \Illuminate\Queue\Worker::getNextJob which does this:

…
                if (! is_null($job = $connection->pop($queue))) {
                    return $job;
…

The ->pop() method might call \Illuminate\Queue\RedisQueue::retrieveNextJob which is the one honoring the connection retry_after.

But most of my queues, or at least the ones where I observed this, have only one worker which is either a) working or b) looking for a job, how could that one worker right in the middle of processing a job detect the retry_after 🤔

wfeller commented 4 years ago

Not sure you can set $retryAfter to null: https://github.com/laravel/framework/blob/7.x/src/Illuminate/Queue/Connectors/RedisConnector.php#L48

    public function connect(array $config)
    {
        return new RedisQueue(
            $this->redis, $config['queue'],
            $config['connection'] ?? $this->connection,
            $config['retry_after'] ?? 60, // <- this here
            $config['block_for'] ?? null
        );
    }

What I do is I use multiple queue connections with different retry_after values based on the expected timeout I will allow on the jobs. Then I also have multiple horizon supervisors for these queue connections.

how could that one worker right in the middle of processing a job detect the retry_after

Weird. Maybe it's because of your processes config value in horizon.php?

I'm able to reproduce it here: https://github.com/wfeller/test-retry-after

Run php artisan horizon
Run php artisan dispatch-test-job
Open storage/logs/laravel.log (you may need to create an sqlite DB + migrate the failed jobs table)

If you run php artisan dispatch-test-job --times=10 (more than max processes in horizon.php), you'll see that some of your jobs don't get retried.

mfn commented 4 years ago

            $config['retry_after'] ?? 60, // <- this here

Ouch!

But this also means that in practice this check below from \Illuminate\Queue\RedisQueue::migrate won't do anything, unless you've a custom connector :-/

        if (! is_null($this->retryAfter)) {
            $this->migrateExpiredJobs($queue.':reserved', $queue);
        }

mfn commented 4 years ago

Oh lol, it does work setting to null in horizon, because there the code is different, see https://github.com/laravel/horizon/blob/e01c4a3f8bf88046479163b6b74eef1df4165f76/src/Connectors/RedisConnector.php#L22

        return new RedisQueue(
            $this->redis, $config['queue'],
            Arr::get($config, 'connection', $this->connection),
            Arr::get($config, 'retry_after', 60),
            Arr::get($config, 'block_for', null)
        );

driesvints commented 4 years ago

Sorry for not understanding everything exactly but is there anything we should fix in either Horizon or the docs? Anything we can do to make things more clear?

mfn commented 4 years ago

Still working on it 😄

mfn commented 4 years ago

@wfeller disabling via null might actually not work though because \Illuminate\Queue\RedisQueue::retrieveNextJob:

        $nextJob = $this->getConnection()->eval(
            LuaScripts::pop(), 3, $queue, $queue.':reserved', $queue.':notify',
            $this->availableAt($this->retryAfter)
        );

if $this->retryAfter is null it will be treated like 0 in availableAt which means the job might be marked as being immediately available.

I will got for an insanely high value for now, seems safer!

mfn commented 4 years ago

Basically, everything has been said, yes: the culprit was the retry_after being set to low!

Increased the value to an insane number to effectively disable it as I don't need this feature.

Thanks everyone, especially @wfeller for their feedback, much appreciated 🙇

wfeller commented 4 years ago

No problem @mfn 👍

What I do now is I use multiple queue connections with different retry_after values, and multiple horizon supervisors to correctly manage queues.

In queue.php:

    'connections' => [
        'redis' => [
            'driver' => 'redis',
            'connection' => 'default',
            'queue' => 'default',
            'retry_after' => 90,
        ],

        'redis-slow' => [
            'driver' => 'redis',
            'connection' => 'default',
            'queue' => 'default-slow',
            'retry_after' => 1830, // Max $timeout of 1800 seconds in my jobs
        ],
    ],

In horizon.php:

        'production' => [

            'supervisor' => [
                'connection' => 'redis',
                'queue' => ['default', /* other normal queues */],
                'timeout' => 60,
            ],

            'supervisor-slow' => [
                'connection' => 'redis-slow',
                'queue' => ['default-slow', /* other slow queues */],
                'timeout' => 1800,
            ],
        ],

Then I'll just use $this->onQueue('default-slow') in my slow jobs' constructors (you can probably also use $this->onConnection('redis-slow') instead)

J-Brk commented 3 years ago

@mfn You sure this is issue should be closed? You might have resolved it by using an extremely large timeout but the fact that Horizon does support setting it to null and the queue not seems like abnormal behavior to me.

laravel / framework

[QUEUE] Job aborted for unclear reasons, retried altough shouldn't, but was still running? #32103

Description:

TL;DR