Jobs marked as failed even if they are running correctly

crash13override commented 7 years ago

I have a few queue jobs that needs some time to be run because they are encoding videos. They typically last a couple minutes.

When I fire the job, it runs correctly and I can see that it's being executed in the "Recent Jobs" section of Horizon. I can see it running, then it gets a red cross icon as if it failed, and then immediately after it becomes green marking it as completed. The problem is that I can see it also in the "Failed" section, where it says that the job has failed throwing a "Illuminate\Queue\MaxAttemptsExceededException".

At the moment I have 2 processes for the queue. If I set it to just 1 process, then the problem is not happening anymore.

I also tried setting the "timeout" property for the job to 1800 seconds, but it looks like it doesn't care.

Could it be that if I have 2 processes the second one tries to run the job even if the first one is still running it? Is there something I need to consider for especially long jobs? I have another queue with 10 processes running small tasks and that is not giving any problem at all.

Thanks a lot for your help and congrats for the amazing library!!!

themsaid commented 7 years ago

So the job is running, then marked as failed, then marked as completed?

crash13override commented 7 years ago

yes, exactly! and the weird thing is that after a few minutes, they are removed from the failed jobs list. (not sure if this happens when I deploy an update and it runs horizon:terminate).

crash13override commented 7 years ago

after some further testing, I noticed that the list of the failed jobs was emptied after it run the same queue job, but without errors because with a very small file.

after that, for some mysterious reason now the encoding jobs are no longer listed under the "failed jobs" section, but I can still see the icon of the job in the "recent jobs" section turning red with a cross and then finally green.

Since all the code is running just fine, it's no big deal as long as it keeps my failed jobs list empty. I'll let you know in case I manage to reproduce the error in a more precise way

I attach a few screenshots showing the job that goes from yellow to red and then green screen shot 2017-09-08 at 15 55 11

crash13override commented 7 years ago

Another weird thing is happening with failed jobs. Previously I said that the list of failed jobs was emptied, but actually it's not. I still receive the failed jobs via AJAX, but for some reason the frontend doesn't want to show them. Here you have a screenshot showing how the list is empty even though I receive a JSON full of failed jobs.

screen shot 2017-09-11 at 02 58 29

My setup is the following

Server A MySql and Redis DB server

Server B Webserver

Server C Encoding server

Server B and Server C are both running "php artisan horizon" and each one with it's own queue (default for webserver and encode for encode server). I run "php artisan horizon:screenshot" just on the webserver.

So the webserver fires event on both queues (default and encode), but listens just for the default one. The encode server just listens on the encode queue.

Am I missing something somewhere? Should my setup be different? Or could it be just a bug of Horizon?

Thanks again for your help!!!

crash13override commented 7 years ago

I'm still experiencing the problem with the missing failed jobs. I tried to debug it quickly but didn't manage yet. Does this happen to other people as well or is it just me?

alexardalich commented 7 years ago

I'm not quite seeing that.

Report job which takes a few mins to run (does some API calls, db queries and mails an Excel file) where after a little time the job is marked failed with a MaxAttemptsExceededException and stays failed.

Yet after some time, the report pops in my Inbox.

Have moved the reporting queue back to database.

alexardalich commented 7 years ago

Actually no, after walking away from it for a bit, it has done exactly as you've described.

A job that was marked failed became green. The earlier one in this screenshot

screen shot 2017-09-22 at 1 28 40 pm

failed jobs page became empty, but the dashboard reports I had a failed job

screen shot 2017-09-22 at 1 29 10 pm

and then the second attempt failed

screen shot 2017-09-22 at 1 30 05 pm

but it didn't populate into the failed job page clicked on the error within recent jobs

screen shot 2017-09-22 at 1 30 42 pm

After about 5-10 mins, that second one changed to green

screen shot 2017-09-22 at 1 42 28 pm

dashboard still flagged the failed ones that have turned green after a while, the failed jobs page remains empty

screen shot 2017-09-22 at 1 47 29 pm

and the job did work, have the report in my Inbox

dennisoderwald commented 7 years ago

We have exactly the same problems. Have you found a solution?

cytRasch commented 7 years ago

We also have the same problem for all of our longtime running jobs.

vesper8 commented 7 years ago

I'm also having a similar problem.. might be the same. I have a lot of jobs that fail due to "A queued job has been attempted too many times or run too long. The job may have previously timed out."

Increasing the tries or timeout time has done nothing to help. And the jobs are succeeding.. IE no other errors are thrown that I can see.

I haven't found a helpful way to get a more specific output of what's causing the jobs to fail

alexardalich commented 7 years ago

I think it may not be horizon related. Moved the job to the database driver and see the same behaviour whereas I have very long running jobs on the database driver on a 5.1 project which are fine.

May try moving those 5.5 long running jobs to the 5.1 project

doomtickle commented 7 years ago

@vesper8 Same issue here. Only difference is that this works fine in my valet dev environment, but jobs fail every time when deployed to Forge.

crash13override commented 7 years ago

I'm not 100% but I think it could be related to this issue in my case

#205

After setting a higher value (1800) for config->queue->connections->redis->retry_after everything seems to be running ok and it doesn't run the jobs twice.

Still testing but so far so good...

I think that it's a bit misleading the word 'retry_after' because if it's like @denaje says in his last message, than it behaves more like a secondary "timeout" instead. So it's a bit like overriding the "timeout" value in the horizon config file.

Or am I getting this wrong perhaps @themsaid?

Please give it a try and let me know if it fixes the issue for you as well.

Thanks!

alexjjassuncao commented 7 years ago

@crash13override I'm testing 1800 on retry_after too. Just to know, how much timeout you have set on your config/horizon.php?

crash13override commented 7 years ago

I've set 1800 as well

alexardalich commented 7 years ago

ah, the penny drops.

Have created another connection in queue config with longer retry_after to use for my reports queue.

Thank you

pdbreen commented 6 years ago

Exactly the problem I'm having, but I couldn't figure out where to set the retry_after value. Now I know and will give this a shot!

Thanks!

lukepolo commented 6 years ago

I also am having this issue.

sisve commented 6 years ago

Several people with similar problems are running Windows, which does not support timeouts. Some screenshots in this issue shows unix paths, so that's not the issue. However, what version of PHP are you using? Timeouts require PHP 7.1 or newer. Can you confirm that you've met this requirement, or are you using PHP 7.0?

sisve commented 6 years ago

I notice now that I've jumped repositories. While laravel/framework will accept PHP 7.0 (in composer.json), Horizon will not.

crash13override commented 6 years ago

I can confirm that I have PHP 7.1 so the issue it's not related to the PHP version, but to the "retry_after" value after all.

themsaid commented 6 years ago

Please everyone make sure the retry_after value is greater than the time it takes a job to run, this is mentioned in the queue documentation already.

zlanich commented 6 years ago

I'm having this same issue still on Laravel 5.7+, PHP 7.1+. I have jobs that run ~6 min, with a $timeout set to 30 min. Not using Horizon. Using SQS + Heroku + Supervisor. Job falls into my failed_jobs table, but I get an email that my report was generated successfully right around that time. It seems to think it exceeded timeout or max tries, but the jobs finished without exceptions.

Is there another issue like this open right now? Can this one be re-opened if necessary?

sisve commented 6 years ago

@zlanich

You're not using Horizon, and yet write in the laravel/horizon repository?
The last post tells everyone to check their retry_after value, but you doesn't mention yours. What is your retry_after time, and is it higher than the timeout value=

zlanich commented 6 years ago

@sisve I apologize for the post in a laravel/horizon thread, but this was the only thread I could find on the internet where someone else was having this same issue. After looking at the retry_after value, I recall that SQS does not support a retry_after value, so I'm not sure what to do here.

It does not make any sense to me why Laravel/SQS would be allowing/attempting a retry if the job is still running/etc. I'm not sure how you would handle long-running jobs with Laravel/SQS under these circumstances.

If anyone can help, I'd be hugely appreciative, as this application runs our city's entire mobile parking infrastructure! Also, if anyone knows of a non-horizon thread that I missed with someone else having this same issue, please let me know!

Thanks again.

JeremyHargis commented 6 years ago

@sisve

Several people with similar problems are running Windows, which does not support timeouts.

Does this only apply to Horizon, or to any job queues in standalone Laravel too? I looked at the Laravel queue documentation and I don't see anything about this limitation on Windows, but I am seeing a related problem on my Windows server.

Thanks.

zlanich commented 6 years ago

I was able to adjust my Amazon SQS Visibility Timeout to fix this issue, since the retry_after option is not supported for SQS. This isn't ideal, but it did solve my issue (for all intents & purposes). I feel like Laravel Core should do some sort of coalesce on the retry_after and timeout so it doesn't do funky stuff like this. Am I crazy?

sisve commented 6 years ago

@JeremyHargis Timeouts require the pcntl extension for php. This extension isn't available on Windows. (This also implies that timeouts will not work on a *nix-system that uses php without pcntl either.)

This applies to Laravel's queue system and isn't Horizon specific.

https://github.com/laravel/framework/blob/951a12fb2e1539c84a30172caf5fca33d72a1bec/src/Illuminate/Queue/Worker.php#L110-L112

https://github.com/laravel/framework/blob/951a12fb2e1539c84a30172caf5fca33d72a1bec/src/Illuminate/Queue/Worker.php#L536-L539

vedmant commented 5 years ago

Yeah, I've spent an hour trying to debug why my job fails, until I found this issue. I think this have to be highlighted in the documentation next to each timeout mention.

jur3 commented 4 years ago

I had the same issue with too low retry_after Thank you for saving my life ... beer on me ;)

themsaid commented 4 years ago

It's there :)

https://laravel.com/docs/6.x/queues#job-expirations-and-timeouts

crash13override commented 4 years ago

Thanks a lot @themsaid ,

It's much more clear now in the docs and it will help a lot of people to avoid running in the same problem for sure!

Thanks again and keep up with the awesome work you've been doing on Laravel!

luciantugui commented 2 years ago

even though timeout is smaller than retry_after, once a job failed, subsequent jobs are immediately failed after very short life span. 0secs. I use horizon with supervisor. Basically it is an endless loop, all new jobs which scheduled at the regular interval, are failed right away with the message "has been attempted too many times or run too long. The job may have previously timed out."

manually triggered jobs are running properly

anstapol commented 1 year ago

Same here, the solution from the docs (about having more of retry_after than timeout) didn't help.

No exception is thrown, when I manually run the job through tinker it's lightning fast and successful.

AidasK commented 1 year ago

It would have saved tons of time for us if laravel would have added an assertion of retry_after >= timeout. Is there a use case where timeout can be higher than retry_after?

alfonsogarza commented 5 months ago

It would have saved tons of time for us if laravel would have added an assertion of retry_after >= timeout. Is there a use case where timeout can be higher than retry_after?

Were you able to resolve this @anstap?

anstapol commented 5 months ago

It would have saved tons of time for us if laravel would have added an assertion of retry_after >= timeout. Is there a use case where timeout can be higher than retry_after?

Were you able to resolve this @anstap?

In our case there were tons of jobs accessing same table reading/writing so when I executed jobs manually those always worked but when a big batch entered the queue it still failed. In the end we've decided to remove the moderation code which was checking records before insert and simply use upsert in combination with unique key constraints. That removes the need for each job to check the data validity. It's still not ideal for us because we've sacrificed a small feature but In case you have same issue I don't think there's much you can do.

laravel / horizon

Jobs marked as failed even if they are running correctly #128