laravel / framework

The Laravel Framework.
https://laravel.com
MIT License
32.52k stars 11.02k forks source link

Laravel 5.3 Job fails without calling failed() method #15192

Closed jasonmccreary closed 8 years ago

jasonmccreary commented 8 years ago

In Laravel 5.3, a Job will silently fail (no method invocation and nothing in the Laravel log) if you have not updated the jobs failed() method to receive an Exception.

For example, the following code is acceptable in Laravel 5.2, but will not be called in Laravel 5.3.

public function failed()
{
    // handle failure
}

It needs to be updated to:

public function failed(Exception $exception)
{
    // handle failure
}

While I would not consider the update a bug, it was rather time consuming to track down since no error was reported. After determining the error, part of me thinks the unexpected method signature should throw an exception.

If nothing else, this was not listed in the Upgrade Guide, so opening this issue for future readers.

GrahamCampbell commented 8 years ago

PHP doesn't work like that. Adding another parameter won't cause it to suddenly start getting called. PHP would still call it, but with just too many parameters, which is the same as calling it with no parameters for all intense and purposes.

jasonmccreary commented 8 years ago

Normally I would agree, but from what I can tell, this alone resolved the issue. I'll attempt to recreate with some sample code.

jasonmccreary commented 8 years ago

Could not reproduce with a sync queue driver running Laravel 5.3.6. Will try with the database queue driver as that's where I first experienced the issue.

I understand this is unlikely the issue. But now I'm curious what I changed that resolved my original issue.

jasonmccreary commented 8 years ago

Okay, so I think this is related to retry_after. While it's true the failed() method is not being called, it's because the job is still running. It runs so long it triggers the a retry which starts another job, then another, then another, then another...

My job queue did not behave this way when running 5.2. So I still think there's something wonky here. I'll keep digging. But for now, I'll just set my retry_after higher.

jesseleite commented 8 years ago

We're doing an upgrade to 5.3 right now and coming across some weirdness around failed jobs with the database queue driver.

If I have an email job sitting in jobs table, and throw an intentional exception, then run php artisan queue:listen --tries=3, I expect the job to fail 3 times then move to the failed_jobs table (at least that's how it worked for us in 5.2). However in 5.3, the job disappears off both tables with no valuable output by the listener, and no email sent.

So confused! Not sure if our problems are related, but it seems something is funky with failed job handling and database queues right now.

jasonmccreary commented 8 years ago

@JesseLeite that sounds exactly like the issue I was having at first as well. Be sure to restart your queue after upgrading, I think that was one of the things that added to the weirdness.

Nonetheless, I still think there's something here. I planned to look at this closer over the weekend to provide some code samples to reproduce the issue between 5.2 and 5.3.

bobbybouwmann commented 8 years ago

Currently my job is failing because of the issue @jasonmccreary raised in the first place

ErrorException: Missing argument 1 for Illuminate\Queue\Jobs\Job::failed(), called in 
/vagrant/vendor/laravel/framework/src/Illuminate/Queue/InteractsWithQueue.php on line 47 
and defined in /vagrant/vendor/laravel/framework/src/Illuminate/Queue/Jobs/Job.php:143

I can find this in my failed_jobs table in the exception column.

Making sure that the failed method gets the correct exception as an argument (see below) I see my jobs disappearing after failing. So they can't be found in the jobs or the failed_jobs table. There is also no exception in the log..


@GrahamCampbell The failed method in InteractsWithQueue is currently not matching the interface the given Job has.

public function failed()
{
    if ($this->job) {
        // This needs an Exception as argument according to it's interface.
        return $this->job->failed(); 
    }
}
jesseleite commented 8 years ago

@jasonmccreary sorry for delayed reply, was camping all week ⛺

Be sure to restart your queue after upgrading

As mentioned in my original post, we're running into this problem when manually running php artisan queue:listen --tries=3 locally, so it's not like we had any processes/daemons to restart.

I planned to look at this closer over the weekend to provide some code samples to reproduce the issue between 5.2 and 5.3.

Curious if you had a chance to dig in?

jesseleite commented 8 years ago

@crynobone and I were chatting about this commit he made a while back to InteractsWithQueue. He suggested the failed() method be removed from InteractsWithQueue altogether, since at the time of his commit, the failed() method had different intention than the current implementation in 5.3. I definitely agree with his suggestion.

That said, we're still having problems with jobs not being released to either the jobs or failed_jobs tables (the jobs just disappear when PHP dies). @Kcharle and I feel like we've narrowed it down to the unserialize() call in CallQueuedHandler on line 81, however we cannot consistently reproduce as it seems to depend on where the exception is being thrown from. Also not sure if it's directly related to database queue driver, or if it affects all queue drivers. Bashing our heads here!

jasonmccreary commented 8 years ago

@JesseLeite sorry for the delay. I have not. Just as you have, I spent a few hours trying to determine the root cause before moving on.

I definitely think there is something here. Whether it's the config, PHP 7, the daemon, the driver, or some combination therein, I don't know. But somethings up. Hopefully we'll get some more participation on this issue.

themsaid commented 8 years ago

Check this commit: https://github.com/laravel/framework/commit/e1d60e0fe120a7898527fb997aa2fb9de263190c

The method was renamed to fail() to avoid the conflict, maybe the answer lies there.

jesseleite commented 8 years ago

@themsaid, thank you I saw @taylorotwell's reply and commit. However, even if we add our own failed() handling directly on our job, the issue still remains. Taylor's commit removes the method name conflict, but it's a different issue that's not affecting us right now.

I'm not 100% sure, but I believe it might be something that changed in this commit, which was a pretty big overhaul of queues.

themsaid commented 8 years ago

@JesseLeite Not sure if your issue is the same as @jasonmccreary, I'm kind of confused here :) I feel like they are two different issues, no?

jesseleite commented 8 years ago

@themsaid possibly. I feel like we're both pointing at different possible causes, but neither of us are really sure what's really going on. @jasonmccreary, correct me if I'm wrong, but what I believe we both have in common is this:

  1. A job fails silently (no exception output or "FAILED" output when running queue:listen).
  2. That job is not released back either the jobs or failed_jobs table (when PHP dies, we lose that job forever).

UPDATE: It seems the exception is visible in my log file, but as mentioned no valuable output from queue:listen, and the job is not returned back to jobs or failed_jobs table.

jasonmccreary commented 8 years ago

Yes, @themsaid my original issue seems to be the same as what @JesseLeite is now describing. I backed off as I did not have time to reproduce the error.

Since, I have noticed something else unrelated to this specific issue, but related to the queue behavior which is the combination of tries vs retry_after configuration noted in my previous comment.

themsaid commented 8 years ago

@JesseLeite I've tried to replicate the issue using the database driver, I throw an exception in the handle() method of the Job and then start the queue using:

php artisan queue:listen --tries=3

The Job runs 3 times and then gets removed from the jobs table and a new record is added to the failed_jobs table.

Are you experiencing the issue with the database driver too? @jasonmccreary couldn't replicate using the sync driver and I'm unable to replicate using the DB driver.

erikmoren commented 8 years ago

I had the same problem that the failed method was not called, and the job disappeared and was not moved to the failed_jobs table.

I changed my job´s failed method to also accept the event, like this: public function failed($event, Exception $e)

and now it works.

According to this code

    public function failed($e)
    {
        $payload = $this->payload();
        list($class, $method) = $this->parseJob($payload['job']);
        $this->instance = $this->resolve($class);
        if (method_exists($this->instance, 'failed')) {
            $this->instance->failed($payload['data'], $e);
        }
    }

in this file https://github.com/illuminate/queue/blob/5.3/Jobs/Job.php it looks like it tries to send the exception as the second argument, no ..?

Sorry if I am way off here, I am quite new to Laravel.

philbates35 commented 8 years ago

Hi, we came across an issue this morning that sounds like its related to other posts here after upgrading to 5.3 yesterday (specifically 5.3.9 and the database queue driver).

We had multiple jobs that hit the default timeout of 60 seconds over night, and the Illuminate\Queue\TimeoutException: Queue child process timed out after 60 seconds. exception was logged as expected. However, the job that hit the timeout has disappeared from both the jobs table and the failed_jobs table, so we're not able to see the details of the job that failed. The only reason we're aware that the job failed is because the TimeoutException was logged.

Supervisor was set up to run php artisan queue:work --daemon --memory=2048 --tries=1 so it was using the default timeout of 60.

After digging through the code and trying to replicate locally, I think the expected behaviour would be that after the TimeoutException is thrown, the job would be attempted to be retried, which would result in a MaxAttemptsExceededException thrown (as we have --tries=1), which would also be logged and the job moved over to the failed_jobs table with the MaxAttemptsExceededException and its stacktrace stored the in exception column. Stangely, I was unable to reproduce locally - the job that timed out was moved to the failed_jobs table with the exception MaxAttemptsExceededException as expected. But in production, that didn't happen.

The jobs in question don't have a failed() method on them (don't know if that is relevant).

Let me know if you want any more information.

themsaid commented 8 years ago

@philbates35 I've tried to replicate locally too with no luck, that's one tricky issue to trace.

jasonmccreary commented 8 years ago

@philbates35's issue is more inline with what I'm now experiencing. However, I'm still curious what the original issue was.

I agree with @GrahamCampbell that the missing parameter isn't how PHP normally behaves, but I'm curious with PHP 7's new Error exception, if certain error reporting configurations would cause this behavior.

Anyway, I'm curious the stack for people having this issue. Maybe there's a clue there. I was running Laravel 5.3.2, PHP 7.0.10, and supervisor was managing my job queue.

philbates35 commented 8 years ago

Here's our stack:

Laravel 5.3.9 Database queue driver Work command: php artisan queue:work --daemon --memory=2048 --tries=1 CentOS 6.8 PHP 5.6.23 (IUS yum repository) MySQL 5.6.31 (IUS yum repository) Supervisor 2.1 (EPEL yum repository)

jesseleite commented 8 years ago

@JesseLeite I've tried to replicate the issue using the database driver [...]

@themsaid Yesterday I created a fresh 5.3 project and tried to reproduce as well, but can't seem to be able to. I would think it has something to do with my code, but then it seems quite a few people are having issues with queues lately as well. I'm just not sure where to look. It worked in 5.2, and nothing has changed on this 5.3 upgrade branch apart from upgrade guide related commits. I'll keep playing around to see if I can reproduce in a fresh 5.3 app.

Anyway, I'm curious the stack for people having this issue.

Laravel 5.3.9 and upgraded to 5.3.10 Database queue driver Work command: php artisan queue:listen --tries=1 Server stack: Homestead 0.4.2 (which has PHP 7.0.3, MySQL 14.14, etc.)

I'm skeptical of it being a stack issue, since everything worked in same environment with 5.2. But who knows, I'm lost!

themsaid commented 8 years ago

Are you all using the database driver? or is it failing on different drivers?

philbates35 commented 8 years ago

@themsaid reading through this issue thread again, it seems everyone that has reported issues here was using the database driver (references to jobs and failed_jobs tables, or explicitly stating they are using the database driver).

philbates35 commented 8 years ago

That doesn't rule out that there issue didn't occur with other drivers too of course, just that no one has reported an issue with the other drivers.

themsaid commented 8 years ago

Yeah that's what I suspect, I'm thinking if maybe some sql error happens in the middle and for some reason doesn't get logged.

themsaid commented 8 years ago

Was speaking with Taylor yesterday about this issue, if it only exists in the database driver while in a production environment then it shouldn't be considered an issue since using the database driver in production is not recommended.

Does any of you have the same issue while using any user driver but the database one?

jasonmccreary commented 8 years ago

So that begs two questions:

  1. Is this documented?
  2. Was this not recommended starting in 5.3?
themsaid commented 8 years ago
  1. not afaik
  2. it's not recommended in general

I agree that this needs better documentation, but first we need to make sure the problem is only using the database queue driver.

jasonmccreary commented 8 years ago

Honestly this is a bit frustrating. I've unknowingly been using Queues with the database driver since Laravel 5.1. It's worked fine until 5.3. Now, from my perspective, it doesn't work in 5.3, so I open an issue and the answer is basically - you shouldn't have been using that.

Fine, but let's at least document that if there are no plans to make the database driver production-ready. I doubt I'm the only person using the database driver in production. Clearly, there's at least 4 of us 😉

jesseleite commented 8 years ago

I have tried to reproduce, but have spent hours and can't nail down an exact situation where it happens, and where it doesn't happen. Not even sure it's JUST the database driver, but it could be.

since using the database driver in production is not recommended

100% agree with @jasonmccreary. If the database is worthy of storing production data, dealing with password resets tokens, etc., why should it not be worthy of queue job handling? I don't think shunning the database driver is the solution.

felorhik commented 8 years ago

The described issue seems to happen in the sqs driver as well.

The issue I see is that it does not log to the failed_jobs table if it hits a timeout. The timeout is listed as 60 seconds as @philbates35 had said, except I see the same issue on the sqs driver and not the database driver.

I have been using 0 or "unlimited" timeout with daemon or none daemon worker, both cause the issue. Also sqs does use a little bit of a different way for the retry_after its setup in sqs itself as the visibility timeout, and that is set to 15 mins currently. So in other words there is nowhere that states a timeout of 60 seconds, at least in my code (tried to look into source, but its hard to grasp it quickly but I will keep looking).

The queue's also vanish if they timeout (although above it stated as when PHP restarts/fails), like described for many using the database driver. So I would say it can be confirmed that this problem is an issue in the sqs driver and the database driver based on my findings.

EDIT: After looking a bit more into the logs (bugsnag) I might of confused it happening in both daemon and not daemon. The logs in the staging/production environment both go there, so when I seen recent reports of it, I confused it as having happened in production. Likely at least the timeout portion of this issue is related to daemon queue workers. As for the failed_jobs / vanishing queue's portion, I still see it in both.

EDIT 2: I noticed that the default timeout is used even if the timeout is set as 0 at times. Since the expectation is to have an unlimited timeout and its actually using the default it can lead to misunderstandings. I have not been able to figure out exactly why the default is used, outputting the timeout seems to show that is is properly read as zero, yet every so often a timeout exception happens even when I have only a single worker. The timeout does not trigger a failed job as well, but in some cases the markJobAsFailedIfAlreadyExceedsMaxAttempts does trigger it correctly (happens after the timeout). It is fairly inconsistent for it to happen and I think it is due to the timeout happening after sqs makes the job visible again. So a slightly smaller timeout then the sqs visibility timeout could likely prevent that from happening (guessing). However that is the only way that a timeout will trigger a failed job.

The source code shows that the timeout only logs an exception and does not mark the job as failed. If this is changed it could likely fix this if the issue is queue's timing out and causing the jobs to just vanish.

Also noticed this happening with the redis driver as well.

Sorry for being long winded. I just have fought with this for a couple weeks now so I want to note all my findings.

taylorotwell commented 8 years ago

Can someone briefly summarize the issue in like 1 sentence?

"The failed() method is never called for jobs which hit the timeout limit N number of tries"

Is that correct? It's hard to parse out what the issue is in all these replies.

felorhik commented 8 years ago

@taylorotwell I think there is various issues that all seem similar as the result is lost queues that don't end up in the failed job queue.

Anyway in one sentence mine would be "Using no timeout (0) results in the default timeout (60) being used sometimes." (don't know the exact trigger for it, if its every time or just sometimes)

The queue when timing out does not trigger the failed function, so it took a lot of digging to figure it out. I temp solved it by using a number other then 0 in the timeout, matching it up with the sqs visibility timeout, now I at least get markJobAsFailedIfAlreadyExceedsMaxAttempts triggered.

I use SQS, with a 15 min visibility timeout, and I have jobs that take over 60 seconds to complete (though i have 0 as the timeout). That should be enough to replicate at least my version of the issue.

jasonmccreary commented 8 years ago

Sorry, this does seem to have become a catch-all for queue issues.

I have opened a tries vs _retryafter issue (https://github.com/laravel/framework/issues/15696). While it might be related, I consider that a separate issue than what I originally opened here.

For me, this issue was - failed jobs are silently lost to the ether

taylorotwell commented 8 years ago

Yeah this thread has basically become an unusable mass of confusion with no clear steps on what to do to reproduce anything.

jasonmccreary commented 8 years ago

Fair. However, I think there's something here. I had tried to come up with reproduction steps, but my focus became getting my app working again.

I'll try to reinvestigate early next week. Feel free to tag this accordingly to relay it's on hold.

jesseleite commented 8 years ago

The issue I see is that it does not log to the failed_jobs table if it hits a timeout. [...] "The failed() method is never called for jobs which hit the timeout limit N number of tries" [...] Is that correct?

Just for the record, if I manually set the timeout high, my failed job is still silently lost to the ether.

php artisan queue:listen --tries=1 --timeout=10000

Yeah this thread has basically become an unusable mass of confusion [...] Can someone briefly summarize the issue in like 1 sentence?

_Since upgrading to 5.3, jobs with errors are sometimes failing silently and are not being sent back to the jobs or failedjobs table (lost forever).

Whether because of database driver, retry_after, timeout, etc., and whether these issues are related, I think we're all equally confused. @taylorotwell I apologize for the confusion, but I think if any of us could clearly replicate we create a proper PR instead of an issue.

taylorotwell commented 8 years ago

Even if someone could give me a list of say 5 steps to reproduce the issue on a fresh Laravel install that would be helpful.

felorhik commented 8 years ago

https://github.com/jordanramstad/queue-validator

I have made a simple package for my own use to help mitigate any damage from queue's vanishing. This is no fix but at the very least might help to diagnose / reproduce issues related to queue's.

No unit tests or anything and just kind of pulled out of my application so might not work in all environments but I tried to add a clear readme anyway if anyone wants to try it out.

I don't claim this is the best way to do it, just the way I am currently leaning on, feel free to open issues in it to let me know what you think.

medilek commented 8 years ago

I had same issue that job wasn't written into failed_jobs table after it falied. I finally found this error in log:

[2016-10-05 14:06:14] local.ERROR: PDOException: SQLSTATE[42S22]: Column not found: 1054 Unknown column 'exception' in 'field list' in .../vendor/doctrine/dbal/lib/Doctrine/DBAL/Driver/PDOConnection.php:77

I had to delete migration file and entry in migrations table for failed_jobs and rerun php artisan queue:failed-table command. This happened after I upgraded from 5.2 to 5.3. I didn't find anything in documentation about this change and need to update structure of failed_jobs table.

Hope this helps at solving the issue.

Was jobs table structure changed as well?

2016_08_08_115250_create_failed_jobs_table.php.txt 2016_10_05_145707_create_failed_jobs_table.php.txt

jasonmccreary commented 8 years ago

@mediley, these changes (and now a sample migration) are listed in the Upgrade Guide under the Database Driver Changes section.

timothyallan commented 8 years ago

@JesseLeite Exactly! I just had to roll back a production update to 5.3 this morning because our jobs seem to work 4 out of every 5 times, using the DB driver and latest 5.3. It's the same as you, they mysteriously vanish with no error or trace... Which obviously is the worst for anyone trying to fix this!

I even have Log::info(xxxx) throughout the calling function, and all the Log entries get logged without issue when the jobs disappear.

Php7/Postgres Laravel 5.3.18 artisan queue:work --queue=priority,basic,default --tries=3

taylorotwell commented 8 years ago

I seriously would never use the DB driver for queues in production. It is no intended to be a production solution.

timothyallan commented 8 years ago

Now, that's interesting. I never knew that the DB driver was not a production solution.

We're not processing millions of jobs, and it is nice to have a central repository for jobs without having to pay for Amazon. We can drop/spin up servers on a whim and all the jobs stay put in the db.

jasonmccreary commented 8 years ago

@timothyallan ensure you restart your queues after pushing then new code. I experienced this same problem and in retrospect, I believe it was due to the old queue process still running.

timothyallan commented 8 years ago

@jasonmccreary Yup this is all on docker, so every time I make a code change, everything is restarted. I still remember I got stung a year ago, for half a day with the old "queue processes still using old code" issue :)

jasonmccreary commented 8 years ago

@timothyallan roger that. Regardless, I still believe there is a ghost in the code that seems to be appearing in 5.3. I've managed to tweak my app's code/configuration enough to continue to use the database driver. However, it does sound it is officially not recommended for production.

jesseleite commented 8 years ago

It is not intended to be a production solution.

To be fair, if failed jobs are unexpectedly lost to the ether, it's not stable enough for any environment :P

taylorotwell commented 8 years ago

Give me a fresh application that recreates that problem.