lantins / resque-retry

A resque plugin; provides retry, delay and exponential backoff support for resque jobs.
MIT License
421 stars 139 forks source link

ExponentialBackoff working randomly #42

Closed pimeys closed 10 years ago

pimeys commented 12 years ago

I really cannot be sure to which project this bug relates to, but I'll start here.

We have a class, which is using resque-retry with ExponentialBackoff plugin and MultipleWithRetrySupression exception backend in the following way:

require 'resque-retry'
require 'resque/failure/redis'

class WorkerClass
  extend Resque::Plugins::ExponentialBackoff
  Resque::Failure::MultipleWithRetrySuppression.classes = [Resque::Failure::Redis]
  Resque::Failure.backend = Resque::Failure::MultipleWithRetrySuppression

  @queue = :foobar

  # Retry strategy    2m    8m   24m   ~1h   ~2h   ~6h    ~14h   ~1,5d   ~3d     ~7d
  @backoff_strategy = [120, 480, 1440, 3840, 9600, 23040, 53760, 122880, 276480, 614400]

  def self.perform
    #fast pinging here
  end
end

We have a modified Resque where we don't fork at all and we run the workers in the same process using Ruby fibers for concurrency. For Redis we use em-redis library, set in Resque.redis initialization.

The code processes a huge amount of jobs every day and if the job fails, we want to retry it. Sometimes the retrying works as it should (with the correct timespans), but randomly it just doesn't retry at all or stops retrying too early. There are also cases, where we have 20-30 retries for a single job.

The worker's job is to ping 3rd party servers storing the response. If the response is not successful, our ensure block raises an exception and stores the result, so in a perfect world the retry plugin would catch this and delay a new job.

I have a feeling it might also relate to the weird exception handling in the fibers and EventMachine...

lantins commented 12 years ago

Thanks for all the information @pimeys

Could you please clone the repo and run the tests, see if anything fails?

I'll be taking a look at another issue this weekend, I'll swap em-redis into the mix and see if I can reproduce your issue.

If you could provide some more information about your stack, that would be very useful, what very of Ruby, Redis? etc.

Thanks!

Luke

pimeys commented 12 years ago

I've read all the tests many times and run them all. Works 100%. It's a pretty hard problem, because in the test environment and even on staging env everything works just fine. But again, on production, we have situations where one job don't retry at all, one retries a couple of times (11 is the target) and some retry 200-300 times in a short timespan.

The stack is following:

I even tried to do my own test environment, which just fails jobs and I also implemented a couple of functions from resque-retry to incr and poll the retry counter. Everything works, but on production there are still some misbehaving jobs. The percentage is below 1%, but when processing a huge amount of work every day, the amount of bad ones is too big.

lantins commented 12 years ago

Thank you very much for the extra detail, its useful and also very interesting for a geek like me ;-)

Your issue may be related to these:

I spent some time over the weekend doing a little digging / testing, not much to report (apart from above). Should manage to tinker some more this week, will report back what I find.

pimeys commented 12 years ago

I already wrote a couple of rake tasks, which will go through the error and retry queues and clean them up with a set of rules. I need to check these issues you posted to see, if this problem is related to a failure in failure handling. :) Although it's working most of the time, which is kind of weird and makes me think the problem is with the EventMachine, async redis gem and some kind of race condition.

pimeys commented 12 years ago

It seems to work much better. We used to have ~1-10% of the failed callbacks failing properly, others were behaving in a not so proper way. Now the percentage is somewhere 60-70% of proper fails, but we still have a couple of problematic ones left. Those may be old enough and not good for saying do we still have this bug. Although it works much better now.

lantins commented 12 years ago

I've pushed a prerelease gem: v1.0.0.a

Could you please give this version a try and let me know if everything is all fixed?

pimeys commented 12 years ago

It's now at least ending the retries after 11 tries (which is our maximum). But something is still wrong. We have jobs in the Resque fail queue with only 3-5 retries and some jobs with 11 tries happening in a timespan of one day. Our whole backoff strategy is eleven tries with the last retry in seven days.

One issue I can think of is that the retry job key is not always unique. We're using a subid as a method attribute and we have a small change of firing two callbacks with the same subid. If they both fail, they will race the retry count pretty fast. This is still a very rare case and doesn't explain this many errors in our system.

Example, retried nine times, strategy is the same as the code in my first message:

+---------------------+
| created_at          |
+---------------------+
| 2012-05-26 03:00:20 |
| 2012-05-26 03:02:30 |
| 2012-05-26 03:04:37 |
| 2012-05-26 03:06:39 |
| 2012-05-26 04:10:42 |
| 2012-05-26 04:34:47 |
| 2012-05-26 04:36:50 |
| 2012-05-26 04:38:56 |
| 2012-05-26 04:40:59 |
+---------------------+
9 rows in set (0.03 sec)

The next retry will be today, three days after the first one.

knaidu commented 10 years ago

Any update on this? I believe we are also seeing similar issues. Was this merged ?

lantins commented 10 years ago

@knaidu Can you please the latest gem release? v1.1.1

I've not been able to reproduce this problem so far, would appreciate a test case / process that allows us to reproduce it.

I think this is going to be hard to track down without input from others, will leave open for the moment in the hope either @knaidu or @pimeys could update us.

@jzaleski any thoughts?

jzaleski commented 10 years ago

@lantins let's see how it works on the newer version of the gem. Hopefully @knaidu or one of the other commenters on this issue can chime in w/ more detail and steps-to-reproduce.

lantins commented 10 years ago

I'm closing this; if anyone objects speak up.