Closed tylerstillwater closed 7 years ago
I believe the issue may be in this lua function
local function canRun(lockKey, maxConcurrency)
local activeJobs = tonumber(redis.call('get', lockKey))
if (not maxConcurrency or maxConcurrency == 0) or (not activeJobs or activeJobs < maxConcurrency) then
-- default case: maxConcurrency not defined or set to 0 means no cap on concurrent jobs OR
-- maxConcurrency set, but lock does not yet exist OR
-- maxConcurrency set, lock is set, but not yet at max concurrency
return true
else
-- we are at max capacity for running jobs
return false
end
end
I am not a Redis expert by any means, but I think what might be happening is that the get
on lockKey
is being executed, in parallel, and all 4 of them get a 0
, which allows all 4 jobs to be executed. From what I can see in Redis documentation get
does not block other readers.
Further research suggests that a proper locking implementation needs to be in place for this to work. Out of n
processes and m
workers, only 1
of them should be allowed to read/write the lock information at a time, otherwise we get race conditions.
A possible solution might be the Redlock Algorithm as described on the Redis website.
I'd appreciate any feedback on this. I am not sure if I am completely off track here.
One critical piece you seem to be missing is that Lua scripts are atomic.
Yep, that'd be a pretty critical piece. Good to know. Thanks for the link!
Given that, I'm not sure what could be going on. Perhaps the lock data in redis got into a bad state when I upgraded the version of gowork, or some other weird thing. I am going to try spinning down all my servers and deleting the lock keys, then spinning everything up fresh.
If that doesn't work, I'm not sure what to do next to debug this further.
Finally had a bit of time this morning to look at this again. After further examination, I believe the issue lies here:
if haveJobs(jobQueue) and not isPaused(pauseKey) and canRun(lockKey, maxConcurrency) then
acquireLock(lockKey, lockInfoKey, workerPoolID)
res = redis.call('rpoplpush', jobQueue, inProgQueue)
return {res, jobQueue, inProgQueue}
end
Specifically the acquireLock
call. canRun
returns true
if the maxConcurrency
value is 0 or unset. This triggers acquireLock
to be called. However, if the worker is then updated to have a MaxConcurrency
value set, and it is less than whatever is currently set in redis for the lock value (in my case, 1 requested, 4 already set), the worker freezes because it believes there are locks in place when, in fact, there are not and there should not be.
I believe the solution is twofold:
maxConcurrency
lockKey >= 0 && lockKey <= maxConcurrency
The second entry above is necessary to patch any running processes that may be affected by this issue. The other option is to have people go in and manually delete lock keys.
@tylerb this most recent commit should delegate lock cleanup to the reaper, which runs every ~10 min (see this lua script). Can you confirm that you're still facing issues even with these latest reaper changes?
@shdunning I didn't know that functionality was in there. Thanks for the info!
I didn't let my code run with MaxConcurrency
on for 10 minutes as I can't afford to let my queue not process events for 10 minutes. However, I have a staging environment where I could test this if necessary.
Regardless, I can't set MaxConcurrency
on my production environment, even if this does solve my problem, because I can't wait 10 minutes for my queues to start working again. Would it be possible to run the reaper at startup to ensure everything is in a sane state?
In addition, looking at this, it seems the reaper looks for dead processes and decrements the lock value by whatever value is stored in the hash for that specific process. In my testing, I've seen lock value go negative in the hash. Given this, I feel it would be a good idea to either:
@tylerb multiple items to address, bear with me.
RE: reap cycles: The reaper does run immediately at startup, but unfortunately in most CI/CD environments the first reap cycle ends up being a no-op because the dead worker pool's heartbeat expiration is less than the deadTime. Since the heartbeats are refreshed every 5 seconds, we've discussed adjusting the deadTime to something substantially less than 5 minutes in order to determine whether or not a worker pool is dead which would increase the chances of successful reaping on that first reap cycle. I think this might be our best path forward to help avoid potential 10 minute deadlocks in your env.
RE: negative lock values: We noticed that too in our environments and PR #52 was designed to address the issue. We've been running with these changes since 6/22 (and tested in our staging env weeks before that) and have no longer observed the negative lock values. If you do continue to see them, then we've more work to do!
RE: your 2 proposed changes above:
In general I'm not a huge fan of just correcting bad lock state on-the-fly. I'd rather understand the situations that are getting us into the bad state (like a dirty shutdown) and make the software more resilient to them.
Thanks for the diligence here -- we'll work through this!
Thanks for your time and consideration on all this, @shdunning. Much appreciated. I am certainly new to all this stuff, so I appreciate your explanation and discussion.
Reap cycles: Adjusting the deadTime does sound like a much better approach. In addition, adding a delay to the first reap after startup may also prove useful. Perhaps that would give dead workers a chance to expire.
Negative lock values: I got them immediately after setting MaxConcurrency:1
and pushing the new binary. I'm not entirely sure how it happened, unfortunately. Prior to that, there was no MaxConcurrency
set, and I had 2 processes with 2 workers each. Perhaps I had some bad data in my lock keys for the job prior to pushing?
As for my two proposed changes:
decrby
, you will increment the lock value. This will either cause the lock value to go > max concurrency
or simply have locks taken for which there are no jobs. This is assuming, of course, that there are negative values in the hash.MaxConcurrency
is then set to a value lower than the current lock value in redis, we are now deadlocked. I have seen this happen on my environment.I do not know exactly how it got into that state. I would assume that my jobs would terminate and decrement the lock value as part of a graceful shutdown. Perhaps heroku hard killed the binary before jobs completed, leaving the lock value in a bad state.
Please let me know what else I can do to assist with this. Thanks again!
Come to think of it, I know heroku has killed my worker processes many, many times in the past due to the bug I fixed in my previous PR: #54
As my redis data has never been reset, it's certainly possible my lock values were all kinds of screwed up.
@tylerb yeah, unfortunately there was a bug in the reaper when MaxConcurrency
was introduced that can leave locks in a bad state prior to the newest changes. Provided the old worker pool ids are still in the worker pools set, then the reaper should fix things back up. But as you mentioned, you can't wait 10 min for that 2nd reap cycle...let me put together a branch with some changes to make sure that first reap cycle runs. Maybe you can test with that?
More soon.
Fixed by #58.
This is going to be vague, as I'm currently trying to debug what's going on, but the
MaxConcurrency
option seems to be broken.According to my current understanding:
For example, the
:lock
key for a job, under load, should be the same as the value ofMaxConcurrency
, but mine stays at 0. The:lock_info
key should have individual lock counts, per worker, that add up to the total of:lock
for that job. However, the contents of that hash vary wildly between-4
and4
. I am testing with aMaxConcurrency
of 1.It seems there is a race condition in the increment/decrement logic. I have a worker pool size of 4, so the -4 and 4 variance could come from there. Perhaps the system is attempting to pull 4 jobs at once, and all 4 succeed in incrementing/decrementing?