lantins / resque-retry

A resque plugin; provides retry, delay and exponential backoff support for resque jobs.
MIT License
421 stars 139 forks source link

How to diagnose jobs seemingly not retrying #97

Closed davetron5000 closed 10 years ago

davetron5000 commented 10 years ago

We are on Heroku and, due to daily (or more) dyno restarts, a lot of our jobs fail with SIGTERM. Almost all of these jobs can be retried and we've set up resque-retry.

We still noticed failulres, so for a particular job, we set the job up like so:

class FireItemEventJob
  extend Resque::Plugins::Retry

  @queue = :fire_item_event
  @retry_limit = 20
  @retry_delay = 60 #seconds

  def self.perform(item_id,event_name,current_user_id=nil,new_hizzy_id=nil)
    # ...
  end
end

We still see consistent failures from this job. I just can't believe that it failed twenty times in a row over 20 minutes, each time with a SIGTERM.

This leads me to suspect that the job is not being retried at all.

How can I confirm this? Or, how can I examine what Resque retry is doing when looking at logs? I don't see any logging info in the source code—is there some I'm missing and should expect to be there?

Sorry for using issues for support—please let me know if there's a better place to ask this.

lantins commented 10 years ago

Hi @davetron5000

Are you using a failure backend? What version of Ruby, Resque and resque-retry are you using?

Logging can be supported by use of a failure backend (part of Resque).

We can bat this around on GitHub issues or if you use IRC/some-other-IM I'd be happy to chat/help where possible.

Luke

lantins commented 10 years ago

You can find me on Freenode and Quakenet under the user lantins.

davetron5000 commented 10 years ago

Versions:

resque (1.25.2)
resque-retry (1.1.4)
resque-scheduler (2.5.5)
resque-status (0.4.2)
resque_mailer (2.2.6)

Ruby 2.1, and our config looks like so:

require 'resque/failure/redis'
Resque::Failure::MultipleWithRetrySuppression.classes = [Resque::Failure::Redis]
Resque::Failure.backend = Resque::Failure::MultipleWithRetrySuppression

Resque::Plugins::Status::Hash.expire_in = 24.hours.to_i

Resque.logger.level = Logger::INFO

I won't have good access to a chat-like thing today, but maybe Monday. Thanks for anything you can tell me!

jzaleski commented 10 years ago

@lantins i'm around as well if you need help with anything

On Saturday, April 19, 2014, David Copeland notifications@github.com wrote:

Versions:

resque (1.25.2) resque-retry (1.1.4) resque-scheduler (2.5.5) resque-status (0.4.2) resque_mailer (2.2.6)

Ruby 2.1, and our config looks like so:

require 'resque/failure/redis'Resque::Failure::MultipleWithRetrySuppression.classes = [Resque::Failure::Redis]Resque::Failure.backend = Resque::Failure::MultipleWithRetrySuppression Resque::Plugins::Status::Hash.expire_in = 24.hours.to_i Resque.logger.level = Logger::INFO

I won't have good access to a chat-like thing today, but maybe Monday. Thanks for anything you can tell me!

— Reply to this email directly or view it on GitHubhttps://github.com/lantins/resque-retry/issues/97#issuecomment-40878720 .

lantins commented 10 years ago

@jzaleski Hey :) Any assistance from your good self is very welcomed!

@davetron5000 Couple of things:

1) Retries should be working, but that doesn't mean they are! 2) Your code snippets look fine to me. 3) I've never used the Status plugin along side Retry; this may be worth a closer look or taking out the loop to see what happens. 4) Okay, you are using our failure backend; login to the web interface and watch the failing jobs. Do they fail after 20 mins? or do they instantly use up the 20 retry attempts? 5) If possible; run the unit tests on your Heroku stack and see if anything is flagged up. 6) Are you using resque-retry with any other jobs? Do they have similar problems?

davetron5000 commented 10 years ago

Most jobs don't use status, and we do see these same failures on apps w/out that plugin. Pretty much all our jobs use resque-retry.

resque-web is pretty much useless for us because we have thousands of scheduled jobs at any given time (over 50 pages), and it's hard to predict when a job will fail, much less find it in the web UI. 99.9% never fail, and of those that do, almost all are because of SIGTERM and can be retried (which is why I've been bumping the number of retries up so high). That's why I like logging, because I search what happened after the fact.

Given console access, Is there an easy way to access jobs in the schedule that are there for retry reasons? I can manually check at different times or even run a script to check and report back.

Alternatively, is there a way to get access to the number of times a job has retried? e.g. in the job code perhaps? I can use that to add logging as well.

lantins commented 10 years ago

Re: console poking about

All data is stored in Redis, if you wish to poke about look at the following to figure out what keys to look at:

redis_retry_key failure_key

Re: access to retry count in code

All resque-retry internals can be accessed by your job code, you can access the number of times a job has retried with the retry_attempt method.

Depending on the number of jobs; I'd be tempted to simply have a job log some timestamp/retry information to a file that you can later grep/paw over. If you have a larger number of jobs perhaps only start doing this after the 15th retry?

Other

If you've not already read it, this blog post may prove useful: http://hone.heroku.com/resque/2012/08/21/resque-signals.html

davetron5000 commented 10 years ago

Thanks, good info. I will poke around and see what I can see. I tried some of this out locally and it does seem to be working, so I can at least add some logging. My fear is that the jobs are failing outside the resque hook/perform cycle meaning resque-retry isn't getting a hold of these failures.

lantins commented 10 years ago

I have a gut feeling the blog post will help if you were not aware of the signal handling change, there are a number of Resque issues related to Heroku/SIGTERM, this change addresses those issues.

Please let us know what you find.

jzaleski commented 10 years ago

Thanks @lantins that is exactly the resource I was going to point him at.

@davetron5000 two quick questions.. When you start a worker, what options are you specifying in the environment? Also, how long do your jobs take to finish on average?

On Saturday, April 19, 2014, Luke Antins notifications@github.com<javascript:_e(%7B%7D,'cvml','notifications@github.com');> wrote:

I have a gut feeling the blog post will help if you were not aware of the signal handling change, there are a number of Resque issues related to Heroku/SIGTERM, this change addresses those issues.

Please let us know what you find.

— Reply to this email directly or view it on GitHubhttps://github.com/lantins/resque-retry/issues/97#issuecomment-40880010 .

davetron5000 commented 10 years ago

Our invocations generally look like this:

env NEW_RELIC_DISPATCHER=resque TERM_CHILD=1 RESQUE_TERM_TIMEOUT=10 bundle exec rake environment resque:work QUEUE=* --trace

We've been setting TERM_CHILD for the reasons outlined in that article you linked. I haven't resorted to turning on VVERBOSE yet, but I could certainly try it for a few days at least…

lantins commented 10 years ago

@jzaleski Hoping you have some more ideas! :)

davetron5000 commented 10 years ago

I'm gonna run our code today with VVERBOSE, and I also added some logging about the number of retries each time the job runs. This particular job fires a lot during the (East Cost of the US) afternoon, so we're bound to see a failure. With that I can figure out how many times it retries (and when) or if the job simply died out-of-band.

Thanks again for the help. Will report back in a few hours…

jzaleski commented 10 years ago

Sounds good @davetron5000, the more information the better. I was planning to do some work on this today so please let us know your findings.

On Monday, April 21, 2014, David Copeland notifications@github.com wrote:

I'm gonna run our code today with VVERBOSE, and I also added some logging about the number of retries each time the job runs. This particular job fires a lot during the (East Cost of the US) afternoon, so we're bound to see a failure. With that I can figure out how many times it retries (and when) or if the job simply died out-of-band.

Thanks again for the help. Will report back in a few hours…

— Reply to this email directly or view it on GitHubhttps://github.com/lantins/resque-retry/issues/97#issuecomment-40932208 .

davetron5000 commented 10 years ago

OK, so it happened again this morning (finally :). Here's the log excerpt:

Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 2: got: (Job{fire_item_event} | FireItemEventJob | [1059177, "PI scan", 486, nil]) 
Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 1005: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [1059177, "PI scan", 486, nil])] 
Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 1005: (Job{fire_item_event} | FireItemEventJob | [1059177, "PI scan", 486, nil]) failed: #<Resque::TermException: SIGTERM> 
Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [1059177, "PI scan", 486, nil])] 
Apr 22 09:31:14 spectre-production app/fire_item_event_worker.3:  ** [16:31:14 2014-04-22] 2: Exiting... 

Observations:

It seems like its either dying in the after_fork hook, or, it got an exception before calling perform and is dying in the failure hooks.

Does this point us somewhere more specific?

davetron5000 commented 10 years ago

OK, more info. Here's the complete log message from a different failed job, noise stripped

10:25:56 :  ** [17:25:55 2014-04-22] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [1101343, "PI scan", 620, nil])] 
10:25:56 :  ** [17:25:55 2014-04-22] 2: got: (Job{fire_item_event} | FireItemEventJob | [1101343, "PI scan", 620, nil]) 
10:25:56 :  ** [17:25:56 2014-04-22] 1693: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [1101343, "PI scan", 620, nil])] 
10:25:56 :  remote_ip[n/a] uuid[n/a] pid[1693] auth_email[n/a] auth_id[n/a] client_email[n/a] client_id[n/a] user_agent[n/a] heroku_request_id[n/a] Class(1101343,PI scan): attempt 0 of 20 
10:25:56 :  BACKGROUND AFTER Transmitting Message on sf.merch.style.sku.item.1101343.updated -> {:item_id=>1101343} 
10:25:56 :  BACKGROUND START Transmitting Message on sf.merch.style.sku.item.1101343.updated -> {:item_id=>1101343} 
10:25:57 :  ** [17:25:57 2014-04-22] 2: Exiting... 
10:25:57 :  BACKGROUND END Transmitting Message on sf.merch.style.sku.item.1101343.updated -> {:item_id=>1101343} 
10:25:58 :  ** [17:25:57 2014-04-22] 1693: (Job{fire_item_event} | FireItemEventJob | [1101343, "PI scan", 620, nil]) failed: #<Resque::TermException: SIGTERM> 

All that BACKGROUND stuff is from the job, and it's coming from code running in another thread spawned by sucker punch. Basically, it's sending a message on a message bus, and using sucker punch so that that doesn't tie up the main thread. This makes sense if the code is run in a Rails request cycle with a persistent worker such us Unicorn.

I'm guess since Resque workers are short-lived, this is causing a problem, which I will now explain in a timeline, so that someone else can agree with me:

Ticks Master Resque Worker Child Worker Sucker Punch Thread
0 Waiting for Work
1 Sees work, forks alive!
2 Waiting for work working!
3 spawns thread alive!
4 all done still alive!
5 dead? still alive?

Seems like launching threads from a resque job that are intended to outlive the forked worker process is likely a bad idea? I will attempt to reconfigure this stuff without spawning threads and see if that changes things.

lantins commented 10 years ago

Thanks for the information, it seems your figuring it out quite fine by yourself :)

Yes, the threads associated with the worker cannot outlive the parent process.

If the processing (sending the message) happens in another thread/process from the worker it seems sensible the worker won't know anything about any failures, so the plugin would be useless.

Unless the process fails before your thread is spawned; it will never trigger a retry.

I'd suggest sending the message within the same process/thread as the main worker.

If your timeline is correct; the around_perform hook used by resque-retry should still be triggered, so getting a SIGTERM before the child worker perform completes should still trigger the retry logic. But perhaps the around_perform and perform have completed okay? as far as the worker knows anyhow.

davetron5000 commented 10 years ago

OK, more info now. sucker punch, celluloid, and new threads removed. What I'm seeing now is the job getting an exception in what I think is the reconnect call in Resque::Worker (code excerpt here), and it seems like the job is being both retried and sent to the failed queue.

# Resque::Worker#work excerpt
else # this block is the childprocess
 unregister_signal_handlers if will_fork? && term_child
 begin

   reconnect
   perform(job, &block)  # <=== logs indicate this is not being called

 rescue Exception => exception
   report_failed_job(job,exception)
 end

 # ...

Here's what I see in my logs.

First, we see the job get picked up:

Apr 22 17:54:09 app/fire_item_event_worker.3:  ** [00:54:09 2014-04-23] 2: got: (Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"]) 

worker is fire_item_event_worker.3 with a PID of 2.

Next, we see that same worker and PID run the before_fork hooks (of which we don't have any configured anywhere that I can tell):

Apr 22 17:54:09 app/fire_item_event_worker.3:  ** [00:54:09 2014-04-23] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"])] 

The first line of my perform method prints a log message, which we don't see. instead we see that child pid 632 of our parent worker (fire_item_event_worker.3) die immediately with a SIGTERM:

Apr 22 17:54:10 app/fire_item_event_worker.3:  ** [00:54:10 2014-04-23] 632: (Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"]) failed: #<Resque::TermException: SIGTERM> 

The timestamp of this failure is the same as what we see in resque-web:

resque_

I believe this is raised from one of the signal handlers, due to the string "SIGTERM" being the only message:

    # Resque::Worker
    def unregister_signal_handlers
      trap('TERM') do
        trap ('TERM') do 
          # ignore subsequent terms               
        end  
        raise TermException.new("SIGTERM") # <===========
      end 
      trap('INT', 'DEFAULT')

      begin
        trap('QUIT', 'DEFAULT')
        trap('USR1', 'DEFAULT')
        trap('USR2', 'DEFAULT')
      rescue ArgumentError
      end
    end

Around 25 seconds later, we see this: (note that the job is configured to retry after 60 seconds):

Job is picked up, this time by a different worker, fire_item_event_worker.1:

Apr 22 17:55:35 app/fire_item_event_worker.1:  ** [00:55:35 2014-04-23] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"])] 
Apr 22 17:55:35 app/fire_item_event_worker.1:  ** [00:55:35 2014-04-23] 2: got: (Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"]) 

Job is now fully executed:

Apr 22 17:55:35 app/fire_item_event_worker.1:  ** [00:55:35 2014-04-23] 26452: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"])]  

Here is where the job's perform method takes over (disgregd BACKGROUND messages—those messages are not being run in a thread)

Apr 22 17:55:35 app/fire_item_event_worker.1:  Class(918302,assumed rebinned): attempt 0 of 20 
Apr 22 17:55:36 app/fire_item_event_worker.1:  BACKGROUND START Transmitting Message on sf.merch.style.sku.item.918302.updated -> {:item_id=>918302} 
Apr 22 17:55:37 app/fire_item_event_worker.1:  BACKGROUND END Transmitting Message on sf.merch.style.sku.item.918302.updated -> {:item_id=>918302} 
Apr 22 17:55:37 app/fire_item_event_worker.1:  BACKGROUND AFTER Transmitting Message on sf.merch.style.sku.item.918302.updated -> {:item_id=>918302} 
Apr 22 17:55:37 app/fire_item_event_worker.1:  ** [00:55:37 2014-04-23] 26452: done: (Job{fire_item_event} | FireItemEventJob | [918302, "assumed rebinned"]) 

I have verified in our logs that the code that queues this job occured only once, i.e. there were not two jobs with the same args queued.

So, there are two bits of odd behavior:

Job being retried sooner

My only explanation is that somehow, resque-retry is getting a retry_delay of 0 in this code:

      def try_again(exception, *args)
        temp_retry_delay = ([-1, 1].include?(method(:retry_delay).arity) ? retry_delay(exception.class) : retry_delay)

        retry_in_queue = retry_job_delegate ? retry_job_delegate : self
        if temp_retry_delay <= 0 # <==============
          Resque.enqueue(retry_in_queue, *args_for_retry(*args))
        else
          Resque.enqueue_in(temp_retry_delay, retry_in_queue, *args_for_retry(*args))
        end
        # ...
      end

In console the class returns a method with arity -1, so I don't see how this could happen, but since there's no logging about what's going on, it's hard to be sure. It certainly seems possible that this could code run before @retry_exceptions is set, which would set off a codepath returning a 0 for retry_delay.

Job being put into the failed queue AND retried

The only way I can explain this would be this code in the failure backend:

      def save
        if !(retryable? && retrying?)
          cleanup_retry_failure_log!
          super
        elsif retry_delay > 0

If, for some reason, retrying? were to return false, meaning that Resque.redis.exists(retry_key) returned nil, then resque-retry's failure backend would delegate to the default one and the job would go to the failed queue.

And this could happen if the before_perform_retry hook wasn't called, which it most likely wasn't, because my theory is that we died before Resque::Worker.perform is called.

Since there's no logging there's no way to know for sure, but this long-winded explanation fits the facts.

Fixing it

What seemed odd about my journey through the code was that we schedule the job for retry in one place (on_failure hook), but we record the fact that that job is being retried in another (before_perform hook).

Is there a reason for this? Or should we move this code from before_perform_retry into on_failure_retry?

retry_key = redis_retry_key(*args)
Resque.redis.setnx(retry_key, -1)             # default to -1 if not set.
jzaleski commented 10 years ago

Awesome stuff @davetron5000!

It seems like you may be on the right track. I will give your Fixing it suggestion some consideration. In the mean-time, if you'd like, add some logging to the gem and submit a pull-request (I am sure the community would appreciate it -- I will happily merge it).

davetron5000 commented 10 years ago

PR submitted. I'm going to try to run my branch of this in production and see what it tells us.

lantins commented 10 years ago

@davetron5000 Have you tried your suggested fix? If so anything to report?

If all the tests pass with the suggested fix, I see no reason why we can't merge, but I would like to look at this a little closer (perhaps once we've got some data from your extra logging?).

davetron5000 commented 10 years ago

Deployed #99 to production yesterday. Unfortunately I didn't have logging inside the failure backend, so I'm still getting an incomplete picture. Also, the log messages are out of order, but I'm going to blame papertrail for that. I've rearranged them in the order they were logged

Job gets picked up:

Apr 24 14:26:48 worker.3:  ** [21:26:47 2014-04-24] 2: got: (Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil]) 
Apr 24 14:26:48 worker.3:  ** [21:26:47 2014-04-24] 1211: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil])]  
Apr 24 14:26:48 worker.3:  ** [21:26:47 2014-04-24] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil])] 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: (Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil]) failed: #<Resque::TermException: SIGTERM> 

We see the new logging from resque-retry:

Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: on_failure_retry 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: Retrying based on exception 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: NOT Retrying based on criteria 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: trying again 
Apr 24 14:26:48 worker.3:  ** [21:26:48 2014-04-24] 1211: [828852, "PI scan", 329, nil] [Resque::TermException/SIGTERM]: retry delay: 60 for queue FireItemEventJob  

This is where the story ends, unfortunately. It at least confirms that the logic inside the plugin itself is acting as expected, which blows my theory that the retry delay was becoming 0 through some race condition.

I'm going to deploy my updated fork that has logging in the failure backend and we can see where this left off. Also seems that resque scheduler does some logging which I will enable—it could be the culprit here, too.

As before, the job is run again a few seconds later:

Apr 24 14:27:55 worker.1:  ** [21:27:55 2014-04-24] 2: got: (Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil]) 
Apr 24 14:27:55 worker.1:  ** [21:27:55 2014-04-24] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil])] 
Apr 24 14:27:56 worker.1:  ** [21:27:55 2014-04-24] 31442: Running after_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil])] 
Apr 24 14:27:56 worker.1:  ** [21:27:56 2014-04-24] 31442: [828852, "PI scan", 329, nil]: attempt 0 set in Redis 
Apr 24 14:27:56 worker.1:  ** [21:27:56 2014-04-24] 31442: [828852, "PI scan", 329, nil]: before_perform_retry 
Apr 24 14:27:56 worker.1:  remote_ip[n/a] uuid[n/a] pid[31442] auth_email[n/a] auth_id[n/a] client_email[n/a] client_id[n/a] user_agent[n/a] heroku_request_id[n/a] Class(828852,PI scan): attempt 0 of 20 
Apr 24 14:27:57 worker.1:  remote_ip[n/a] uuid[n/a] pid[31442] auth_email[n/a] auth_id[n/a] client_email[n/a] client_id[n/a] user_agent[n/a] heroku_request_id[n/a] BACKGROUND START Transmitting Message on sf.merch.style.sku.item.828852.updated -> {:item_id=>828852} 
Apr 24 14:27:57 worker.1:  remote_ip[n/a] uuid[n/a] pid[31442] auth_email[n/a] auth_id[n/a] client_email[n/a] client_id[n/a] user_agent[n/a] heroku_request_id[n/a] BACKGROUND AFTER Transmitting Message on sf.merch.style.sku.item.828852.updated -> {:item_id=>828852} 
Apr 24 14:27:57 worker.1:  ** [21:27:57 2014-04-24] 31442: [828852, "PI scan", 329, nil]: after_perform_retry, clearing retry key 
Apr 24 14:27:57 worker.1:  ** [21:27:57 2014-04-24] 31442: [828852, "PI scan", 329, nil]: Clearing retry key 
Apr 24 14:27:57 worker.1:  ** [21:27:57 2014-04-24] 31442: done: (Job{fire_item_event} | FireItemEventJob | [828852, "PI scan", 329, nil]) 
lantins commented 10 years ago

Okay, keep us updated, as I'm sure you will!

davetron5000 commented 10 years ago

tl;dr

The plugin assumes that every job's perform method is called in order for it to work. In my case, that's not happening, so the job is being re-queued and put into the failed queue.

I don't know what the solution is, but having try_again check for redis_retry_key would at least prevent the job being effectively duplicated

The details

OK, I believe my theory is confirmed.

Here's the log from the plugin:

Apr 25 15:26:13 worker.8:  ** [22:26:12 2014-04-25] 2: Running before_fork hooks with [(Job{fire_item_event} | FireItemEventJob | [723809, "assumed rebinned"])] 
Apr 25 15:26:13 worker.8:  ** [22:26:12 2014-04-25] 2: got: (Job{fire_item_event} | FireItemEventJob | [723809, "assumed rebinned"]) 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: (Job{fire_item_event} | FireItemEventJob | [723809, "assumed rebinned"]) failed: #<Resque::TermException: SIGTERM> 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: on_failure_retry 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: Exception is sufficient for a retry 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: user retry criteria is not sufficient for a retry 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: try_again 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: retry delay: 60 for class: FireItemEventJob 

And here's the log from the failure back-end:

Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: failure backend save 
Apr 25 15:26:13 worker.8:  ** [22:26:13 2014-04-25] 602: [723809, "assumed rebinned"] [Resque::TermException/SIGTERM]: !(true && false) - sending failure to superclass 

The second line is from this code:

def save
  log 'failure backend save', payload, exception

  if !(retryable? && retrying?)
    log "!(#{retryable?} && #{retryable? && retrying?}) - sending failure to superclass", payload, exception
    cleanup_retry_failure_log!
    super

This means that retrying? is false. It's source:

def retrying?
  Resque.redis.exists(retry_key)
end

And retry_key:

def retry_key
  klass.redis_retry_key(*payload['args'])
end

klass is the job itself, so redis_retry_key is a method provided by the plugin.

So, what this tells me is that when the failure back-end is figuring out what to do, the plugin has not yet placed the job's retry information into Redis, so the failure backend figures the job isn't being retried and defers up the chain (which, in my configuration, places the job on the failed queue).

Since perform on my job isn't being called, nor do we see Resque's after_hook log message, we can assume the error is happening during reconnect here (from https://github.com/resque/resque/blob/1-x-stable/lib/resque/worker.rb#L189 ):

begin

  reconnect
  perform(job, &block)

rescue Exception => exception
  report_failed_job(job,exception)
end

In that case, report_failed_job gets called before any of the perform hooks are called. Sure enough, the redis key is set in a before-perform hook (see https://github.com/stitchfix/resque-retry/blob/72ac3568d4bade04c77222183b6e2d2d71dbd8a2/lib/resque/plugins/retry.rb#L337 ):

def before_perform_retry(*args)
  log 'before_perform_retry', args
  @on_failure_retry_hook_already_called = false

  # store number of retry attempts.
  retry_key = redis_retry_key(*args)
  Resque.redis.setnx(retry_key, -1)             # <=======================
  @retry_attempt = Resque.redis.incr(retry_key)
  log "attempt: #{@retry_attempt} set in Redis", args
  @retry_attempt
end

So, back to report_failed_job, it calls Job#fail, which is:

def fail(exception)
  run_failure_hooks(exception)
  Failure.create \
    :payload   => payload,
    :exception => exception,
    :worker    => worker,
    :queue     => queue
end

run_failure_hooks is what called on_failure_retry in the plugin.

So, it seems that the intended happy path for the plugin would be as follows:

Tick method activity
1 Worker.work
2 fork
3 reconnect
4 Worker.perform
5 Job.perform
6 perform hooks
7 before_perform_retry Sets redis_retry_key
8 MyJob.perform does its thing
9 Exception raised!!
10 Worker.work rescue that exception
11 Worker.report_failed_job
12 Job.fail
13 run_failure_hooks
14 on_failure_retry
15 try_again queues the job for later
16 Job.fail Failure.create
17 Failure.create MultipleWithRetrySuppression.save
18 MultipleWithRetrySuppression.save sees that redis_retry_key is there, so don't bubble up the job to the failed queue

What's happening in my case, is this:

Tick method activity
1 Worker.work
2 fork
3 reconnect
4 Exception raised!
5 Worker.work rescue that exception
6 Worker.report_failed_job
7 Job.fail
8 run_failure_hooks
9 on_failure_retry
10 try_again queues the job for later
11 Job.fail Failure.create
12 Failure.create MultipleWithRetrySuppression.save
13 MultipleWithRetrySuppression.save doesn't find redis_retry_key so job bubbles up to failed queue, despite having been queued for retry at Tick 10

Whew! All that to say that the retry logic assumes that we get as far as perform, or it doesn't work.

Solving this is not as clear to me. We can't move the setting of redis_retry_key into on_failure_retry because it's clear that redis_retry_key needs to be set for every job.

jzaleski commented 10 years ago

@davetron5000 are you working on a fix for this? Should I expect a pull-request? Just trying to ensure that this does not fall off the grid.

davetron5000 commented 10 years ago

I am going to submit a pull request for this, just have been sidetracked. I think you are good to wait until you hear from me

jzaleski commented 10 years ago

Alrighty. Thanks for the the update.

On Fri, May 2, 2014 at 12:23 PM, David Copeland notifications@github.comwrote:

I am going to submit a pull request for this, just have been sidetracked. I think you are good to wait until you hear from me

— Reply to this email directly or view it on GitHubhttps://github.com/lantins/resque-retry/issues/97#issuecomment-42049917 .

davetron5000 commented 10 years ago

My proposed fix is in #100