Job not logged in failed_jobs if timeout occurs within database transaction

graemlourens commented 11 months ago

Laravel Version

10.37.3

PHP Version

8.2.7

Database Driver & Version

MariaDB 10.3.27

Description

We had a 'phantom' problem that jobs that failed, were not logged to the failed_jobs table. After some investigation we found out that by mistake we were doing a heavy process within a db transaction which lead the specific job to timeout.

After testing we found out that the issue only happens if the jobs timeout during an open db transaction

I reported this in laravel/horizon but it turns out it has nothing to do with horizon but seems to be a framework issue.

Steps To Reproduce

Create a job with a timeout of 5 seconds
In the job handle method sleep within a db transaction

    public function handle(): void
    {
        DB::transaction(function (): void {
            sleep(10);
        });
    }

We would expect a job in the failed_jobs table, however this is not the case, and the JobFailed event is also not dispatched

crynobone commented 10 months ago

HI there,

I added the following tests to verify the issue and was unable to replicate the problem, jobs and failed_jobs updated properly. We need to be able to replicate the issue in order to solve the problem.

crynobone commented 10 months ago

Hey there, thanks for reporting this issue.

We'll need more info and/or code to debug this further. Can you please create a repository with the command below, commit the code that reproduces the issue as one separate commit on the main/master branch and share the repository here?

Please make sure that you have the latest version of the Laravel installer in order to run this command. Please also make sure you have both Git & the GitHub CLI tool properly set up.

laravel new bug-report --github="--public"

Do not amend and create a separate commit with your custom changes. After you've posted the repository, we'll try to reproduce the issue.

Thanks!

graemlourens commented 10 months ago

@crynobone Thank you for the time invested.

We'll as a next step try and reproduce this in a fresh laravel project to rule out the fact that this has something to do with our environment and get back to you asap.

graemlourens commented 10 months ago

@crynobone

We have successfully reproduced the mentioned issue in a fresh standalone laravel project. Attached is a ZIP file with a docker environment ready for testing

freshLaravel.zip

Step to reproduce

execute the command (in root directory:

docker-compose build
docker-compose up -d
docker-compose exec php bash
php artisan migrate
php artisan queue:work
open browser http://localhost
There will be one job dispatched which timesout inside transaction and one without
After queue:work dies (because of 'killed') you can go to http://localhost:8080 to verify that there is NO job logged in failed_jobs table
You can run php artisan queue:work again to process the second job, and verify that this one WILL be logged in failed_jobs table

Thank you for your consideration & time

crynobone commented 10 months ago

Can you push the repository to github as suggested above?

crynobone commented 10 months ago

@graemlourens It seems that you didn't specify https://laravel.com/docs/10.x/queues#max-job-attempts-and-timeout and I don't see how it would goes to failed_jobs without the option as it would goes back to jobs and ready for another attempt after each failure.

graemlourens commented 10 months ago

@crynobone We did not specify $tries as by default its 1 and therefore the setting was unnecessary, and we did set the $timeout on the job properly

However just to rule out any error on our side, we added tries=1 on the job, as well as also on the worker, and it had no influence, as suspected.

One Job is logged (the one without transaction) and one job is NOT logged (the one with transaction)

Both jobs are configured exactly the same, the only difference in them is that one has the sleep inside of a transaction and the other does not.

I'm not sure how better we can show you this bug. I highly suspect that you did not execute the steps we provided to reproduce the error or else you could confirm this behaviour by proof of one entry in the failed_jobs table.

If you reconsider to reopen this ticket, we'll be glad to push it as a repository as you requested.

graemlourens commented 10 months ago

@crynobone we've had further success in this matter:

Strangely the bug is fixed if you use 'Batchable' on the Job.

The reason the job is then logged correctly is because of Illuminate\Queue\Jobs\Job. In the 'fail' method there is a check on line 195:

if ($e instanceof TimeoutExceededException &&
            $commandName &&
            in_array(Batchable::class, class_uses_recursive($commandName))) {
            $batchRepository = $this->resolve(BatchRepository::class);

In this case you're running 'rollBack' on the $batchRepository. Even if the job is not attached to a batch, this fixes the problem.

However i don't think its correct for us to have to set all jobs as batchable to resolve this issue. I still believe this should also be handled for jobs without batches.

How do you see this?

crynobone commented 10 months ago

Tries doesn't have a default of 1: https://github.com/laravel/framework/blob/047edbc3a370ccf8be58f35f2e262d59210a3852/src/Illuminate/Queue/Jobs/Job.php#L278

graemlourens commented 10 months ago

@crynobone ok, still however as mentioned, we set tries to 1 and it still happens.

Please also refer to my temporary solution that if you use 'Batchable' on the Job, it works, without it does not.

How do you feel about this?

crynobone commented 10 months ago

I cannot replicate the issue and your reproduction code doesn't show what you are claiming is true. This is why we advised sending the repository to GitHub.

graemlourens commented 10 months ago

@crynobone you are confirming you downloaded our ZIP, ran the commands we provided, and you did NOT see the outcome we described? I do not see how this is possible. I can provide a screenshare movie if you prefer.

Or are you telling me that you will only reproduce it when we have pushed it to repository on github?

crynobone commented 10 months ago

We'll need more info and/or code to debug this further. Can you please create a repository with the command below, commit the code that reproduces the issue as one separate commit on the main/master branch and share the repository here?

Having everything in a single commit allow us to completely eliminate any other possibility outside just the affected code instead having to verify evey files inside the zip file.

graemlourens commented 10 months ago

@crynobone a coworker of mine has uploaded the repo as requested: https://github.com/GrzegorzMorgas/laravel-bug-report

We were a little confused if to commit everything as ONE commit or the base commit of base setup and then a second commit with the changes required to reproduce the issue. The instructions are a little ambigous to us:

"commit the code ... as one separate commit" but then "Do not amend and create a separate commit with your custom changes"

Please let us know if what we created is ok and you can reproduce or you would like the repository in a different way

driesvints commented 10 months ago

@graemlourens use the command from here: https://github.com/laravel/framework/issues/49389#issuecomment-1862206409

Then commit all custom changes separately.

graemlourens commented 10 months ago

@driesvints @crynobone Understood. We've done as requested:

https://github.com/GrzegorzMorgas/bug-report

Step to reproduce

execute the command (in root directory of the folder):

docker-compose build
docker-compose up -d
docker-compose exec php bash
cp .env.example .env
composer install
php artisan migrate
php artisan queue:work
open browser http://localhost
There will be one job dispatched which timesout inside transaction and one without
After queue:work dies (because of 'killed') you can go to http://localhost:8080 to verify that there is NO job logged in failed_jobs table
You can run php artisan queue:work again to process the second job, and verify that this one WILL be logged in failed_jobs table

Important find

If you 'use Batchable' on the Job, then also the one timeouting with transaction suddenly works. We have traced this back to

if ($e instanceof TimeoutExceededException &&
            $commandName &&
            in_array(Batchable::class, class_uses_recursive($commandName))) {
            $batchRepository = $this->resolve(BatchRepository::class);

In Illuminate\Queue\Jobs\Job

Because there is a rollback there, it seems to work. However we believe it should not be necessary to set all jobs as Batchable to get around this bug.

Thank you for your consideration & time

driesvints commented 10 months ago

Thanks @graemlourens. Might be a while before we can dig into this because we're all heading into holidays.

gkmk commented 10 months ago

I have reviewed this issue and it may not be an actual bug after all. This happens because of the async signal handler. src/Illuminate/Queue/Worker.php Line 167 registerTimeoutHandler will kill the current process if timeout reached. At this point it will try to log the failed job in the database but since the transaction was opened, the insert command will not persist.

Here are two solutions:

I recommend updating the Laravel Docs about this case and use the public function failed(\Throwable $exception): void function inside your job to rollback the transaction. I tested this and it works fine (failed job is logged).
If this issue is still found as actual bug, we will need to reset the database connection somewhere in the Worker registerTimeoutHandler function. I dont think this should fall under the scope of the framework though and should be handled within the job failed function instead (point 1)

graemlourens commented 10 months ago

@gkmk Thank you for your insight and suggestions.

The reason i would tend to shy away from option 1) is because then behaviour would be different between jobs that are batchable, and ones that are not. This would be an unnecessary inconsistency in my opinion. (See my previous comment)

On a more subjective note, we have > 100 jobs, most of them have transactions and do not require failed methods. Option 2) would add failed methods just to handle this specific case to not loose jobs, which seems to be quite an overhead.

Will be interesting to see what the core of laravel developers have to say.

crynobone commented 10 months ago

is because then behaviour would be different between jobs that are batchable, and ones that are not. This would be an unnecessary inconsistency in my opinion.

Yes, this is intended because Batchable Jobs offer catch and finally which need to be called within the same process while normal queue typically rely on n+1 (n = number of retries) to throw error.

would add failed methods just to handle this specific case to not loose jobs, which seems to be quite an overhead.

The documentation already cover this in detail:

CleanShot 2024-01-02 at 08 58 32

https://laravel.com/docs/10.x/queues#timeout

graemlourens commented 10 months ago

@crynobone @driesvints @themsaid we disagree with your assessment and rest our case. We hope that in future there will be a fix for this inconsistency which will reduce the chance of loosing jobs, which we find is crucial to be able to rely on laravel queues as a major component for larger applications.

for anybody else worried about this: Our solution will most likely be to declare all Jobs as Batchable, which seems to not have any ill-effect and even if the job is not dispatched within a batch it therefore handles the transaction correctly so that the job is logged properly in case of timeout within a transaction.

crynobone commented 10 months ago

for anybody else worried about this: Our solution will most likely be to declare all Jobs as Batchable,

Read the documentation, the worker can be restarted using a process manager (which should be the default usage in any production applications anyway), or use queue:listen (uses more CPUs if you insist to not going with process manager).

graemlourens commented 10 months ago

@crynobone we use laravel horizon which automatically handles that. But please be aware: If the job had $tries=1 (which most of our jobs do) the job is NOT retried, and NOT logged, and therefore lost.

Please correct my if i'm wrong. This is the whole point of this ticket - loosing jobs

laravel / framework