Closed jakemack closed 8 years ago
What is the size and scope of your batch usage? How many jobs are you running through it each night, how many processes and threads are you running? Which version of Sidekiq/Pro/Ent are you using? This is the broader issue.
The Pending count can go negative for a few reasons:
If this is happening every night, I'd double check that you don't have a config problem leading to (2).
We run ~7 million jobs per night in this specific set of batches, each batch being ~10,000 jobs. Some jobs will requeue themselves (meaning the same parameter signature) based on certain results in the job so new jobs will be added to each batch during processing but I don't believe that should cause any other race conditions since the job creating it is already in the same batch and can't complete until the new one is inserted. We run 100 processes (dynos on Heroku) and 5000 threads. Being on Heroku, the indexes/hostnames should be unique.
sidekiq (4.1.2) sidekiq-pro (3.2.1)
I'm (almost) certain that our issue is from (3). At lower volumes of workers, we never had these issues, but as we scaled up workers we started to see negative pending counts.
I do like the idea behind the Job Transaction API as it seems like the full solution to the root cause of the problem, but I assume that is still some time off. If you don't have any suggestions for the near-term, perhaps I'll attempt to write a modified batch middleware that will at least handle this problem within its own scope.
That said I'll double check our config
If it's happening every night reliably, sounds like you've outgrown your Redis plan and need to move to a larger plan or run your own Redis in-house. Exceptions are pain signals that need attention. At the very least, log and verify you are seeing Redis network issues as the cause of this problem.
Heroku does not work well with reliable fetch because its hostnames are not stable. Sidekiq Pro has a workaround for this internally as a common case but you should move to timed_fetch.
I haven't moved forward with the transaction API yet because there hasn't been much customer feedback that it's been an issue in practice and it requires a significant change to a lot of functionality. Do I change internals, destabilizing the code to solve a theoretical problem or let sleeping dogs lie?
Yeah, I've been meaning to try out timed_fetch, I just haven't had the bandwidth to test it yet.
If there hasn't been much demand for the transaction API, I understand your hesitation in putting in that effort. We don't use anything but the stock middleware so we also don't really have much need for it aside from this batch issue. But this batch issue can be resolved (at least internally to itself) by making the calls within the batch middleware atomic. I will most likely be looking at that in the coming days and if I make any progress on it, perhaps we can get it pulled into the sidekiq-pro gem. It would at least reduce the surface area of the problem.
And that's a good point, I should verify that the network issues are on the Redis side. I just did some poking around and our cloud Redis provider suggests dedicated resources for 50k+/ops per second and we're not there yet, so there's a chance the networks issues are on our end. If not, then yes, we might need to upgrade our plan.
Thanks for the responses, I at least now have a plan of attack for moving forward
Follow #2245 for more detail. Open another issue if you find anything specific we can help with.
Sorry, #3022
We run into negative pending job counts every night and this prevents our batches from finishing properly. I was playing around with the batch middleware today and have found at least two cases (though they're essentiall the same idea) where I can force this negative value to happen.
Given a batch with one job, after successful completion of the job as the middleware tries to run
Sidekiq::Batch::Server#add_success
, an exception raised withinSidekiq::Batch::Server#needs_complete?
orSidekiq::Batch::Server#enqueue_callback
(at lines 60 and 62 for batch completion, I didn't test the calls for batch success) will cause the pending count to decrement to 0, the failed count to increment to 1, no callbacks to be queued, and the job to be put in the retry set. When the job retries successfully, it will decrement the pending count to -1, decrement the failed count to 0, and no callbacks to be queued. I'm manually simulating these exceptions by swapping the batch middleware out for a custom one that is identical except that it raises a generic exception at whatever point I want, but in practice I'm assuming we are occasionally getting timeouts at these second/third redis calls which forces us into this state.Hopefully I described it well enough for you to replicate (it's late and I can't be certain I'm making sense, but I did the tests earlier while I was quite sane). I have a feeling the solution to this is to make write a LUA script to handle all of the redis calls within
Sidekiq::Batch::Server#add_success
andSidekiq::Batch::Server#add_failure
. The main issue with that is replicating theSidekiq::Client.push
call in LUA. It doesn't seem like adding all of this will significantly impact performance because we're already serializing 11 redis calls with multi. Then again, perhaps we can come up with a non-LUA way to handle this. Anyway, let me know what you think.