Negative pending job counts in batches

jakemack commented 8 years ago

We run into negative pending job counts every night and this prevents our batches from finishing properly. I was playing around with the batch middleware today and have found at least two cases (though they're essentiall the same idea) where I can force this negative value to happen.

Given a batch with one job, after successful completion of the job as the middleware tries to run Sidekiq::Batch::Server#add_success, an exception raised within Sidekiq::Batch::Server#needs_complete? or Sidekiq::Batch::Server#enqueue_callback (at lines 60 and 62 for batch completion, I didn't test the calls for batch success) will cause the pending count to decrement to 0, the failed count to increment to 1, no callbacks to be queued, and the job to be put in the retry set. When the job retries successfully, it will decrement the pending count to -1, decrement the failed count to 0, and no callbacks to be queued. I'm manually simulating these exceptions by swapping the batch middleware out for a custom one that is identical except that it raises a generic exception at whatever point I want, but in practice I'm assuming we are occasionally getting timeouts at these second/third redis calls which forces us into this state.

Hopefully I described it well enough for you to replicate (it's late and I can't be certain I'm making sense, but I did the tests earlier while I was quite sane). I have a feeling the solution to this is to make write a LUA script to handle all of the redis calls within Sidekiq::Batch::Server#add_success and Sidekiq::Batch::Server#add_failure. The main issue with that is replicating the Sidekiq::Client.push call in LUA. It doesn't seem like adding all of this will significantly impact performance because we're already serializing 11 redis calls with multi. Then again, perhaps we can come up with a non-LUA way to handle this. Anyway, let me know what you think.

mperham commented 8 years ago

What is the size and scope of your batch usage? How many jobs are you running through it each night, how many processes and threads are you running? Which version of Sidekiq/Pro/Ent are you using? This is the broader issue.

The Pending count can go negative for a few reasons:

Duplicate jobs being created, leading to the same JID being executed twice somehow.
Multiple processes sharing the same reliable_fetch index/hostname, leading to (1).
Network issues causing a precisely timed exception, so the job is counted as successful but not acknowledged, leading to (1).

If this is happening every night, I'd double check that you don't have a config problem leading to (2).

jakemack commented 8 years ago

We run ~7 million jobs per night in this specific set of batches, each batch being ~10,000 jobs. Some jobs will requeue themselves (meaning the same parameter signature) based on certain results in the job so new jobs will be added to each batch during processing but I don't believe that should cause any other race conditions since the job creating it is already in the same batch and can't complete until the new one is inserted. We run 100 processes (dynos on Heroku) and 5000 threads. Being on Heroku, the indexes/hostnames should be unique.

sidekiq (4.1.2) sidekiq-pro (3.2.1)

I'm (almost) certain that our issue is from (3). At lower volumes of workers, we never had these issues, but as we scaled up workers we started to see negative pending counts.

I do like the idea behind the Job Transaction API as it seems like the full solution to the root cause of the problem, but I assume that is still some time off. If you don't have any suggestions for the near-term, perhaps I'll attempt to write a modified batch middleware that will at least handle this problem within its own scope.

jakemack commented 8 years ago

That said I'll double check our config

mperham commented 8 years ago

If it's happening every night reliably, sounds like you've outgrown your Redis plan and need to move to a larger plan or run your own Redis in-house. Exceptions are pain signals that need attention. At the very least, log and verify you are seeing Redis network issues as the cause of this problem.

Heroku does not work well with reliable fetch because its hostnames are not stable. Sidekiq Pro has a workaround for this internally as a common case but you should move to timed_fetch.

mperham commented 8 years ago

I haven't moved forward with the transaction API yet because there hasn't been much customer feedback that it's been an issue in practice and it requires a significant change to a lot of functionality. Do I change internals, destabilizing the code to solve a theoretical problem or let sleeping dogs lie?

jakemack commented 8 years ago

Yeah, I've been meaning to try out timed_fetch, I just haven't had the bandwidth to test it yet.

If there hasn't been much demand for the transaction API, I understand your hesitation in putting in that effort. We don't use anything but the stock middleware so we also don't really have much need for it aside from this batch issue. But this batch issue can be resolved (at least internally to itself) by making the calls within the batch middleware atomic. I will most likely be looking at that in the coming days and if I make any progress on it, perhaps we can get it pulled into the sidekiq-pro gem. It would at least reduce the surface area of the problem.

And that's a good point, I should verify that the network issues are on the Redis side. I just did some poking around and our cloud Redis provider suggests dedicated resources for 50k+/ops per second and we're not there yet, so there's a chance the networks issues are on our end. If not, then yes, we might need to upgrade our plan.

Thanks for the responses, I at least now have a plan of attack for moving forward

mperham commented 8 years ago

Follow #2245 for more detail. Open another issue if you find anything specific we can help with.

mperham commented 8 years ago

Sorry, #3022

sidekiq / sidekiq

Negative pending job counts in batches #3019