laravel / framework

The Laravel Framework.
https://laravel.com
MIT License
32.53k stars 11.02k forks source link

[QUEUE] Job aborted for unclear reasons, retried altough shouldn't, but was still running? #32103

Closed mfn closed 4 years ago

mfn commented 4 years ago

Description:

Today I received a strange sentry error I cannot explain, this is the stacktrace:

Illuminate\Queue\MaxAttemptsExceededException: App\Jobs\SomeJob has been attempted too many times or run too long. The job may have previously timed out.
#22 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(647): Illuminate\Queue\Worker::maxAttemptsExceededException
#21 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(434): Illuminate\Queue\Worker::markJobAsFailedIfAlreadyExceedsMaxAttempts
#20 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(346): Illuminate\Queue\Worker::process
#19 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(306): Illuminate\Queue\Worker::runJob
#18 /vendor/laravel/framework/src/Illuminate/Queue/Worker.php(132): Illuminate\Queue\Worker::daemon
#17 /vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(112): Illuminate\Queue\Console\WorkCommand::runWorker
#16 /vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(96): Illuminate\Queue\Console\WorkCommand::handle
#15 /vendor/laravel/horizon/src/Console/WorkCommand.php(46): Laravel\Horizon\Console\WorkCommand::handle
#14 [internal](0): call_user_func_array
#13 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(32): Illuminate\Container\BoundMethod::Illuminate\Container\{closure}
#12 /vendor/laravel/framework/src/Illuminate/Container/Util.php(36): Illuminate\Container\Util::unwrapIfClosure
#11 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(90): Illuminate\Container\BoundMethod::callBoundMethod
#10 /vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(34): Illuminate\Container\BoundMethod::call
#9 /vendor/laravel/framework/src/Illuminate/Container/Container.php(592): Illuminate\Container\Container::call
#8 /vendor/laravel/framework/src/Illuminate/Console/Command.php(134): Illuminate\Console\Command::execute
#7 /vendor/symfony/console/Command/Command.php(255): Symfony\Component\Console\Command\Command::run
#6 /vendor/laravel/framework/src/Illuminate/Console/Command.php(121): Illuminate\Console\Command::run
#5 /vendor/symfony/console/Application.php(912): Symfony\Component\Console\Application::doRunCommand
#4 /vendor/symfony/console/Application.php(264): Symfony\Component\Console\Application::doRun
#3 /vendor/symfony/console/Application.php(140): Symfony\Component\Console\Application::run
#2 /vendor/laravel/framework/src/Illuminate/Console/Application.php(93): Illuminate\Console\Application::run
#1 /vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(130): Illuminate\Foundation\Console\Kernel::handle
#0 /artisan(37): null

Ok, please bear with me, this alone doesn't look strange.

Horizon queue configuration (that specific queue the job ran on):

            'jobqueue' => [
                'balance' => 'simple',
                'connection' => 'redis',
                'processes' => 2,
                'queue' => ['jobqueue'],
                'sleep' => 0,
                'timeout' => 0,
                'tries' => 1,
            ],

To have as much insights as possible with our queued jobs, I'm using this in my AppServiceProvider (this will later explain why I've the logs I have):

    protected function initializeQueueEvents(): void
    {
        Queue::before(function (JobProcessing $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $logger->info(\get_class($event), $this->getJobDetails($queueName, $job));
        });

        Queue::after(function (JobProcessed $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $logger->info(\get_class($event), $this->getJobDetails($queueName, $job));
        });

        Queue::exceptionOccurred(function (JobExceptionOccurred $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $details = $this->getJobDetails($queueName, $job);
            $details['exception'] = $event->exception;
            $logger->error(\get_class($event), $details);

            $this->reportOtherwiseIgnoredException($event->exception);
        });

        Queue::failing(function (JobFailed $event) {
            $job = $event->job;
            $queueName = $job->getQueue();
            $logger = $this->getLoggerForQueue($queueName);
            $details = $this->getJobDetails($queueName, $job);
            $details['exception'] = $event->exception;
            $logger->error(\get_class($event), $details);
        });
    }

So as you can see, there will be very detailed logging.

And this is what I was able to log (unfortunately very very long lines):

[2020-03-25T09:36:42.579873+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing           {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":0,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67079}
[2020-03-25T09:38:13.177079+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing           {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67077}
[2020-03-25T09:38:13.179925+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed              {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}","exception":"[object] (Illuminate\\Queue\\MaxAttemptsExceededException(code: 0): App\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. at /project/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:647)"} {"process_id":67077}
[2020-03-25T09:38:13.180380+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred   {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":1,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}","exception":"[object] (Illuminate\\Queue\\MaxAttemptsExceededException(code: 0): App\Jobs\\SomeJob has been attempted too many times or run too long. The job may have previously timed out. at /project/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:647)"} {"process_id":67077}
[2020-03-25T09:40:50.637754+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed            {"name":"App\Jobs\\SomeJob","id":"537026499","attempts":0,"maxTries":null,"timeout":null,"command":"O:44:\"App\Jobs\\SomeJob\":9:{s:5:\"queue\";s:11:\"attachments\";s:15:\"\u0000*\u0000modelId\";i:1234;s:10:\"connection\";N;s:15:\"chainConnection\";N;s:10:\"chainQueue\";N;s:5:\"delay\";N;s:10:\"middleware\";a:0:{}s:7:\"chained\";a:0:{}s:6:\"\u0000*\u0000job\";N;}"} {"process_id":67079}

I tried to reduce this to the bare minimal were I discovered something interesting (I stripped out the job ID, but they are all the same!): image

So it looks like (looking at the timestamps), almost 30s (but not exactly, but maybe ok) later the job failed ("timed out"?) and started a new one, which immediately failed (expected, tries=1).

But the original was still running?

There were no deployments, code changes or system changes when this happened, i.e. the very horizon supervisor was running for some time already.

Via sentry I also have the information how the worker, which generated this error, was started:

Starting Artisan command: horizon:work

'horizon:work' redis --delay=0 --memory=128 --queue=jobqueue --sleep=0 --timeout=0 --tries=1 --supervisor='server-dlsK:jobqueue'

The two issues I'm facing:

Can someone assist in digging into this?

Note regarding reproducibility: hardly, but it happens from time to time

mfn commented 4 years ago

What's also strange:

mfn commented 4 years ago

In the meantime I was able to collect (from multiple jobs, same queue) other examples but their timestamps don't connect to the "maybe 30s timeout" correlation I almost made.

One example (shortened for readability):

2020-03-25T11:52:30.747052+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T11:54:00.327897+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T11:54:00.332469+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed
[2020-03-25T11:54:00.333052+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred
[2020-03-25T11:56:38.811077+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed

Another example:

[2020-03-25T18:04:44.818191+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T18:06:14.876755+00:00] production.INFO: Illuminate\Queue\Events\JobProcessing
[2020-03-25T18:06:14.879100+00:00] production.ERROR: Illuminate\Queue\Events\JobFailed
[2020-03-25T18:06:14.879315+00:00] production.ERROR: Illuminate\Queue\Events\JobExceptionOccurred
[2020-03-25T18:08:52.875876+00:00] production.INFO: Illuminate\Queue\Events\JobProcessed

I did omit it, but the process_id patterns match as in the initial description (i.e. first and last are the same process job, the others in between are from another process)

mfn commented 4 years ago

In the code of one of the jobs I found that I catch an Exception somewhere. This catch block does not propagate the exception further as it's handled locally and, even in case of exception, this job is expected to successfully return.

Could a "thrown but not bubbled" exception somehow interfere here?

mfn commented 4 years ago

Man, I did a lot of local testing, setting env vars to production, disabling debug, use the same sentry setup, simulate the job failing in the same way: yet this only happens in production 🙈

driesvints commented 4 years ago

Hey man. Pretty swamped atm so don't have time to deep dive into this. Maybe @themsaid can help out?

mfn commented 4 years ago

Yeah sure, no pressure. I'm grateful for any tips how to further debug this 🙏

wfeller commented 4 years ago

Don't you just have a too low retry_after queue config value for your job?

Your jobs seem to be retried after 90 seconds (which is the default config value in queue.php) while they seem to run for about 4 minutes, so they're automatically being retried while the 1st instance of the job is still running.

Maybe try increasing the retry_after value in your queue.php config file?

See: https://laravel.com/docs/7.x/queues#job-expirations-and-timeouts https://github.com/laravel/laravel/blob/master/config/queue.php

mfn commented 4 years ago

Oh.my.god

From conf/queue.php:

        'redis' => [
            'driver' => 'redis',
            'connection' => 'queue',
            'queue' => 'default',
            'retry_after' => 90,
            'block_for' => 3,
        ],

Big 🤦‍♂️

TBH I was totally oblivious of that. I manage >20 queues via horizon and basically never checked the config/queue.php since ages.

Having many queues with different strategies, it seems to me I should rather disable it?

Found people which similar or puzzling issues, why a job which is still actively running would already be retried:

I also found https://github.com/laravel/framework/issues/31562#issuecomment-590035918 which explains something I also wasn't ware of:

TL;DR

(or better: TL I still should hve read) Make sure your connection retry_after is higher then your highest queue timeout; or disable it altogether!

mfn commented 4 years ago

There's no explicit information but reading the source, it seems setting it to null is the way to disable it https://github.com/laravel/framework/blob/7.x/src/Illuminate/Queue/RedisQueue.php#L193-L195

    protected function migrate($queue)
    {
        $this->migrateExpiredJobs($queue.':delayed', $queue);

        if (! is_null($this->retryAfter)) {
            $this->migrateExpiredJobs($queue.':reserved', $queue);
        }
    }
mfn commented 4 years ago

Fun fact, weapon with all this knowledge, I actually can't reproduce this:

When I try to understand the flow in \Illuminate\Queue\Worker::daemon which calls \Illuminate\Queue\Worker::getNextJob which does this:

…
                if (! is_null($job = $connection->pop($queue))) {
                    return $job;
…

The ->pop() method might call \Illuminate\Queue\RedisQueue::retrieveNextJob which is the one honoring the connection retry_after.

But most of my queues, or at least the ones where I observed this, have only one worker which is either a) working or b) looking for a job, how could that one worker right in the middle of processing a job detect the retry_after 🤔

wfeller commented 4 years ago

Not sure you can set $retryAfter to null: https://github.com/laravel/framework/blob/7.x/src/Illuminate/Queue/Connectors/RedisConnector.php#L48

    public function connect(array $config)
    {
        return new RedisQueue(
            $this->redis, $config['queue'],
            $config['connection'] ?? $this->connection,
            $config['retry_after'] ?? 60, // <- this here
            $config['block_for'] ?? null
        );
    }

What I do is I use multiple queue connections with different retry_after values based on the expected timeout I will allow on the jobs. Then I also have multiple horizon supervisors for these queue connections.

how could that one worker right in the middle of processing a job detect the retry_after

Weird. Maybe it's because of your processes config value in horizon.php?

I'm able to reproduce it here: https://github.com/wfeller/test-retry-after

If you run php artisan dispatch-test-job --times=10 (more than max processes in horizon.php), you'll see that some of your jobs don't get retried.

mfn commented 4 years ago
            $config['retry_after'] ?? 60, // <- this here

Ouch!

But this also means that in practice this check below from \Illuminate\Queue\RedisQueue::migrate won't do anything, unless you've a custom connector :-/

        if (! is_null($this->retryAfter)) {
            $this->migrateExpiredJobs($queue.':reserved', $queue);
        }
mfn commented 4 years ago

Oh lol, it does work setting to null in horizon, because there the code is different, see https://github.com/laravel/horizon/blob/e01c4a3f8bf88046479163b6b74eef1df4165f76/src/Connectors/RedisConnector.php#L22

        return new RedisQueue(
            $this->redis, $config['queue'],
            Arr::get($config, 'connection', $this->connection),
            Arr::get($config, 'retry_after', 60),
            Arr::get($config, 'block_for', null)
        );
driesvints commented 4 years ago

Sorry for not understanding everything exactly but is there anything we should fix in either Horizon or the docs? Anything we can do to make things more clear?

mfn commented 4 years ago

Still working on it 😄

mfn commented 4 years ago

@wfeller disabling via null might actually not work though because \Illuminate\Queue\RedisQueue::retrieveNextJob:

        $nextJob = $this->getConnection()->eval(
            LuaScripts::pop(), 3, $queue, $queue.':reserved', $queue.':notify',
            $this->availableAt($this->retryAfter)
        );

if $this->retryAfter is null it will be treated like 0 in availableAt which means the job might be marked as being immediately available.

I will got for an insanely high value for now, seems safer!

mfn commented 4 years ago

Basically, everything has been said, yes: the culprit was the retry_after being set to low!

Increased the value to an insane number to effectively disable it as I don't need this feature.

Thanks everyone, especially @wfeller for their feedback, much appreciated 🙇

wfeller commented 4 years ago

No problem @mfn 👍

What I do now is I use multiple queue connections with different retry_after values, and multiple horizon supervisors to correctly manage queues.

In queue.php:

    'connections' => [
        'redis' => [
            'driver' => 'redis',
            'connection' => 'default',
            'queue' => 'default',
            'retry_after' => 90,
        ],

        'redis-slow' => [
            'driver' => 'redis',
            'connection' => 'default',
            'queue' => 'default-slow',
            'retry_after' => 1830, // Max $timeout of 1800 seconds in my jobs
        ],
    ],

In horizon.php:

        'production' => [

            'supervisor' => [
                'connection' => 'redis',
                'queue' => ['default', /* other normal queues */],
                'timeout' => 60,
            ],

            'supervisor-slow' => [
                'connection' => 'redis-slow',
                'queue' => ['default-slow', /* other slow queues */],
                'timeout' => 1800,
            ],
        ],

Then I'll just use $this->onQueue('default-slow') in my slow jobs' constructors (you can probably also use $this->onConnection('redis-slow') instead)

J-Brk commented 3 years ago

@mfn You sure this is issue should be closed? You might have resolved it by using an extremely large timeout but the fact that Horizon does support setting it to null and the queue not seems like abnormal behavior to me.