laravel / framework

The Laravel Framework.
https://laravel.com
MIT License
32.52k stars 11.02k forks source link

Error removing jobs from the reserved queue #47356

Closed andrej-schaefer closed 1 year ago

andrej-schaefer commented 1 year ago

Laravel Version

10.13.1

PHP Version

8.2.6

Database Driver & Version

No response

Description

We encountered problems with job uniqueness. The jobs, which are configured to have a maximum of 1 attempt, failed multiple times and couldn't be written to the failed_jobs table due to the failed_jobs_uuid_unique constraint. It seems that the failed_jobs_uuid_unique constraint may be a separate bug for jobs that have a "ShouldBeUnique" implementation, and there is an issue with error handling when they fail multiple times. Upon further investigation through the stack trace, we found that the issue was caused by the MaxAttemptsExceededException.

[2023-06-05 13:59:10] production.ERROR: SQLSTATE[23000]: Integrity constraint violation: 1062 Duplicate entry 'b0985078-42cd-4512-a70f-91fad1f69ca5' for key 'failed_jobs.failed_jobs_uuid_unique' (Connection: mysql, SQL: insert into `failed_jobs` (`uuid`, `connection`, `queue`, `payload`, `exception`, `failed_at`) values (b0985078-42cd-4512-a70f-91fad1f69ca5, redis, default, {"backoff":null,"uuid":"b0985078-42cd-4512-a70f-91fad1f69ca5","timeout":290,"tags":{},"id":"b0985078-42cd-4512-a70f-91fad1f69ca5","job":"Illuminate\\Queue\\CallQueuedHandler@call","pushedAt":"1685973367.0205","maxExceptions":null,"maxTries":1,"data":{"command":"O:47:\"App\\Domain\\Caching\\Jobs\\FullyCacheIdentifierJob\":1:{s:56:\"\u0000App\\Domain\\Caching\\Jobs\\FullyCacheIdentifierJob\u0000refUuid\";s:36:\"63f3cb9c-66f4-47af-a915-882d37e999d6\";}","commandName":"App\\Domain\\Caching\\Jobs\\FullyCacheIdentifierJob"},"failOnTimeout":true,"silenced":false,"retryUntil":null,"type":"job","displayName":"App\\Domain\\Caching\\Jobs\\FullyCacheIdentifierJob","attempts":2}, Illuminate\Queue\MaxAttemptsExceededException: App\Domain\Caching\Jobs\FullyCacheIdentifierJob has been attempted too many times or run too long. The job may have previously timed out. in /home/www-data/api/vendor/laravel/framework/src/Illuminate/Queue/Worker.php:780
Stack trace:
#0 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(514): Illuminate\Queue\Worker->maxAttemptsExceededException(Object(Illuminate\Queue\Jobs\RedisJob))
#1 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(423): Illuminate\Queue\Worker->markJobAsFailedIfAlreadyExceedsMaxAttempts('redis', Object(Illuminate\Queue\Jobs\RedisJob), 1)
#2 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(384): Illuminate\Queue\Worker->process('redis', Object(Illuminate\Queue\Jobs\RedisJob), Object(Illuminate\Queue\WorkerOptions))
#3 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Queue/Worker.php(175): Illuminate\Queue\Worker->runJob(Object(Illuminate\Queue\Jobs\RedisJob), 'redis', Object(Illuminate\Queue\WorkerOptions))
#4 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(136): Illuminate\Queue\Worker->daemon('redis', 'default', Object(Illuminate\Queue\WorkerOptions))
#5 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Queue/Console/WorkCommand.php(119): Illuminate\Queue\Console\WorkCommand->runWorker('redis', 'default')
#6 /home/www-data/api/vendor/laravel/horizon/src/Console/WorkCommand.php(51): Illuminate\Queue\Console\WorkCommand->handle()
#7 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(36): Laravel\Horizon\Console\WorkCommand->handle()
#8 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Container/Util.php(41): Illuminate\Container\BoundMethod::Illuminate\Container\{closure}()
#9 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(93): Illuminate\Container\Util::unwrapIfClosure(Object(Closure))
#10 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Container/BoundMethod.php(35): Illuminate\Container\BoundMethod::callBoundMethod(Object(Illuminate\Foundation\Application), Array, Object(Closure))
#11 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Container/Container.php(661): Illuminate\Container\BoundMethod::call(Object(Illuminate\Foundation\Application), Array, Array, NULL)
#12 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Console/Command.php(194): Illuminate\Container\Container->call(Array)
#13 /home/www-data/api/vendor/symfony/console/Command/Command.php(312): Illuminate\Console\Command->execute(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
#14 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Console/Command.php(163): Symfony\Component\Console\Command\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Illuminate\Console\OutputStyle))
#15 /home/www-data/api/vendor/symfony/console/Application.php(1040): Illuminate\Console\Command->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#16 /home/www-data/api/vendor/symfony/console/Application.php(314): Symfony\Component\Console\Application->doRunCommand(Object(Laravel\Horizon\Console\WorkCommand), Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#17 /home/www-data/api/vendor/symfony/console/Application.php(168): Symfony\Component\Console\Application->doRun(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#18 /home/www-data/api/vendor/laravel/framework/src/Illuminate/Foundation/Console/Kernel.php(200): Symfony\Component\Console\Application->run(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#19 /home/www-data/api/artisan(35): Illuminate\Foundation\Console\Kernel->handle(Object(Symfony\Component\Console\Input\ArgvInput), Object(Symfony\Component\Console\Output\ConsoleOutput))
#20 {main}, 2023-06-05 14:00:40)) {"exception":{"errorInfo":["23000",1062,"Duplicate entry '3ca6725e-ed61-45b1-8df2-109631e964bb' for key 'failed_jobs.failed_jobs_uuid_unique'"],"connectionName":"mysql"}}

After removing the unique constraint from the database (for easier debugging), we narrowed down the error to the following log line:

[2023-06-05 13:43:54] production.ERROR: App\Domain\Caching\Jobs\BuildCacheJob has been attempted too many times or run too long. The job may have previously timed out. {"exception":[]}

After diving deeper into debugging, we discovered that the jobs were not being removed from the "reserved" queue correctly, and the workers were executing them again, resulting in the MaxAttemptsExceededException. Further examination of the source code revealed a failed removal of entries in \Illuminate\Queue\RedisQueue::deleteReserved. The result of the execution was always 0, indicating that the entry was not being removed.

After spending more hours experimenting with different Redis clients and configurations, we found that setting either compression and/or serialization in the Redis database configuration resulted in a faulty zrem request to Redis. As a workaround, we disabled the two options for the default database, which is used by the worker, and did not investigate the problem further.

Our hypothesis is that the lack of compression or serialization in the Redis client causes a mismatch between the member in the zrem request, leading to the entry not being deleted. Workaround was to disable serializer and compression

'redis' => [

    'client' => env('REDIS_CLIENT', 'phpredis'),

    'options' => [
        'cluster' => env('REDIS_CLUSTER', 'redis'),
        'prefix' => 'XXX:',
        'serializer' => 2, // Redis::SERIALIZER_IGBINARY,
        'compression' => 2 // Redis::COMPRESSION_ZSTD,
    ],

    'default' => [
        'url' => env('REDIS_URL'),
        'host' => env('REDIS_HOST', '127.0.0.1'),
        'password' => env('REDIS_PASSWORD'),
        'port' => env('REDIS_PORT', '6379'),
        'database' => env('REDIS_DB', '0'),
        'scheme' => env('REDIS_SCHEME', 'tcp'),
        'options' => [
            'serializer' => 0, // disabled global configuration due to failing removing jobs from reserver queue
            'compression' => 0, // disabled global configuration due to failing removing jobs from reserver queue
        ],
    ],

    'cache' => [
        'url' => env('REDIS_URL'),
        'host' => env('REDIS_HOST', '127.0.0.1'),
        'password' => env('REDIS_PASSWORD'),
        'port' => env('REDIS_PORT', '6379'),
        'database' => env('REDIS_CACHE_DB', '1'),
        'scheme' => env('REDIS_SCHEME', 'tcp'),
    ],
],

Hope i haven't forgotten anything :) Thanks

Steps To Reproduce

Configure redis DB phpredis as client serializer option 2 (Redis::SERIALIZER_IGBINARY) compression option set to 2 (Redis::COMPRESSION_ZSTD)

'redis' => [

    'client' => env('REDIS_CLIENT', 'phpredis'),

    'options' => [
        'cluster' => env('REDIS_CLUSTER', 'redis'),
        'prefix' => 'XXX:',
        'serializer' => 2, // Redis::SERIALIZER_IGBINARY,
        'compression' => 2 // Redis::COMPRESSION_ZSTD,
    ],

    'default' => [
        'url' => env('REDIS_URL'),
        'host' => env('REDIS_HOST', '127.0.0.1'),
        'password' => env('REDIS_PASSWORD'),
        'port' => env('REDIS_PORT', '6379'),
        'database' => env('REDIS_DB', '0'),
        'scheme' => env('REDIS_SCHEME', 'tcp'),
    ],

    'cache' => [
        'url' => env('REDIS_URL'),
        'host' => env('REDIS_HOST', '127.0.0.1'),
        'password' => env('REDIS_PASSWORD'),
        'port' => env('REDIS_PORT', '6379'),
        'database' => env('REDIS_CACHE_DB', '1'),
        'scheme' => env('REDIS_SCHEME', 'tcp'),
    ],
],

Create a simple job with tries set to 1 (I have also added the ShouldBeUnique to provocate the DB constraint error which from my point of view is a separate error)

class SomeJob implements ShouldQueue, ShouldBeUnique
{
    use Dispatchable, InteractsWithQueue, Queueable, SerializesModels;

    public $tries = 1;

    public function handle(): void
    {
        sleep(1);
    }

    public function uniqueId(): string
    {
        return 7411;
    }
}

Dispatch the job and wait for second execution As described above this will result in multiple executions on the second worker run due to the job was not removed from the reserved queue

driesvints commented 1 year ago

Heya, thanks for reporting.

We'll need more info and/or code to debug this further. Can you please create a repository with the command below, commit the code that reproduces the issue as one separate commit on the main/master branch and share the repository here? Please make sure that you have the latest version of the Laravel installer in order to run this command. Please also make sure you have both Git & the GitHub CLI tool properly set up.

laravel new bug-report --github="--public"

Please do not amend and create a separate commit with your custom changes. After you've posted the repository, we'll try to reproduce the issue.

Thanks!

tillkruss commented 1 year ago

Is this reproducible when the serializer and compression options are disabled?

driesvints commented 1 year ago

@tillkruss seems that works:

Screenshot 2023-06-06 at 15 22 16
tillkruss commented 1 year ago

My first thought is Lua scripts. They are used by the Queue component, but will disregard any serializer or compression options that PhpRedis may use. So PhpRedis may set queue:job-123 to {foo: "bar"} and then the Lua script would consider it gibberish and fail.

For this scenario the only solution is: Don't use data mutations for queue Redis connections.

Alternatively, refactor everything to not use Lua (not a bad idea) and use atomic MULTI transactions.

It may be good to actually unset the two Redis::OPT_* options on the connection in there queue module.

driesvints commented 1 year ago

@tillkruss nice finds. Thanks!

@andrej-schaefer are you willing to work on a PR?

andrej-schaefer commented 1 year ago

@driesvints im sorry but im not Im not so deep in laravel that i can somehow find the error fast enough, and I have very limited time What i can offer is discussion and analysis on proposals :)

BTW: do i have to create the commit for the reproduction if it is already reproduced :), due to my limited time it will take a while. And keep in mind the second error regarding the DB constraint on unique jobs and multiple fails (If you want i can prepare a ticket for it separate) Or you say it is not a bug and i have misunderstood the documentation for unique jobs?

driesvints commented 1 year ago

No worries. I'll keep this open then and see if anyone can help here.

github-actions[bot] commented 1 year ago

Thank you for reporting this issue!

As Laravel is an open source project, we rely on the community to help us diagnose and fix issues as it is not possible to research and fix every issue reported to us via GitHub.

If possible, please make a pull request fixing the issue you have described, along with corresponding tests. All pull requests are promptly reviewed by the Laravel team.

Thank you!

driesvints commented 1 year ago

And keep in mind the second error regarding the DB constraint on unique jobs and multiple fails (If you want i can prepare a ticket for it separate) Or you say it is not a bug and i have misunderstood the documentation for unique jobs?

Maybe that's something for the docs?

andrej-schaefer commented 1 year ago

I mean, when creating unique jobs through ShouldBeUnique interface it means it is unique for the current execution, that multiple workers are not picking a job with the same jobId in parallel? If so then if the same job with the same jobId fails multiple times (f.e. first run fail, second run succeed, third run fail again) it will result in a unqie constraint error I think it is solvable by using the WithoutOverlapping middleware due to this not effect the jobID. As I said, maybe i misunderstand the ShouldBeUnique interface, but even then it is a bug (I think :) )

Huggyduggy commented 1 year ago

@tillkruss : Would this be something relay.so, being the successor of phpredis, would/could handle differently?

Besides, I'd guess there's quite a connection to https://github.com/laravel/framework/pull/40569 & https://github.com/laravel/framework/pull/41718 , which sadly wasn't tackled back than.

tillkruss commented 1 year ago

@Huggyduggy: Theoretically, yes. But I think it's much better to unset any compression/serializer on the queue Redis connection, that will solve it and it's hard to do, just needs a PR?

Huggyduggy commented 1 year ago

Thanks @tillkruss - I've drafted a PR, fixing the same issue as reported by @andrej-schaefer. I'll submit it against the current 10.x branch tomorrow.

Sad that it's prohibiting using compression within queues, guess we'll stick to manually compressing our larger objects within our event listeners & jobs prior to dispatching 😌

jameshulse commented 1 year ago

Can somebody advise whether this issue would also affect Laravel Horizon. I can't reproduce the issue from a quick test.

driesvints commented 1 year ago

Looks like Taylor noted he'd appreciate a note being added to the docs: https://github.com/laravel/framework/pull/48180#issuecomment-1699685697.