MaxConcurrency does not work

tylerstillwater commented 7 years ago

This is going to be vague, as I'm currently trying to debug what's going on, but the MaxConcurrency option seems to be broken.

According to my current understanding:

For example, the :lock key for a job, under load, should be the same as the value of MaxConcurrency, but mine stays at 0. The :lock_info key should have individual lock counts, per worker, that add up to the total of :lock for that job. However, the contents of that hash vary wildly between -4 and 4. I am testing with a MaxConcurrency of 1.

It seems there is a race condition in the increment/decrement logic. I have a worker pool size of 4, so the -4 and 4 variance could come from there. Perhaps the system is attempting to pull 4 jobs at once, and all 4 succeed in incrementing/decrementing?

tylerstillwater commented 7 years ago

I believe the issue may be in this lua function

local function canRun(lockKey, maxConcurrency)
  local activeJobs = tonumber(redis.call('get', lockKey))
  if (not maxConcurrency or maxConcurrency == 0) or (not activeJobs or activeJobs < maxConcurrency) then
    -- default case: maxConcurrency not defined or set to 0 means no cap on concurrent jobs OR
    -- maxConcurrency set, but lock does not yet exist OR
    -- maxConcurrency set, lock is set, but not yet at max concurrency
    return true
  else
    -- we are at max capacity for running jobs
    return false
  end
end

I am not a Redis expert by any means, but I think what might be happening is that the get on lockKey is being executed, in parallel, and all 4 of them get a 0, which allows all 4 jobs to be executed. From what I can see in Redis documentation get does not block other readers.

tylerstillwater commented 7 years ago

Further research suggests that a proper locking implementation needs to be in place for this to work. Out of n processes and m workers, only 1 of them should be allowed to read/write the lock information at a time, otherwise we get race conditions.

A possible solution might be the Redlock Algorithm as described on the Redis website.

I'd appreciate any feedback on this. I am not sure if I am completely off track here.

austintaylor commented 7 years ago

One critical piece you seem to be missing is that Lua scripts are atomic.

tylerstillwater commented 7 years ago

Yep, that'd be a pretty critical piece. Good to know. Thanks for the link!

Given that, I'm not sure what could be going on. Perhaps the lock data in redis got into a bad state when I upgraded the version of gowork, or some other weird thing. I am going to try spinning down all my servers and deleting the lock keys, then spinning everything up fresh.

If that doesn't work, I'm not sure what to do next to debug this further.

tylerstillwater commented 7 years ago

Finally had a bit of time this morning to look at this again. After further examination, I believe the issue lies here:

  if haveJobs(jobQueue) and not isPaused(pauseKey) and canRun(lockKey, maxConcurrency) then
    acquireLock(lockKey, lockInfoKey, workerPoolID)
    res = redis.call('rpoplpush', jobQueue, inProgQueue)
    return {res, jobQueue, inProgQueue}
  end

Specifically the acquireLock call. canRun returns true if the maxConcurrency value is 0 or unset. This triggers acquireLock to be called. However, if the worker is then updated to have a MaxConcurrency value set, and it is less than whatever is currently set in redis for the lock value (in my case, 1 requested, 4 already set), the worker freezes because it believes there are locks in place when, in fact, there are not and there should not be.

I believe the solution is twofold:

Do not increment lock values for jobs which have a 0 maxConcurrency
Add a check to ensure lockKey >= 0 && lockKey <= maxConcurrency

The second entry above is necessary to patch any running processes that may be affected by this issue. The other option is to have people go in and manually delete lock keys.

shdunning commented 7 years ago

@tylerb this most recent commit should delegate lock cleanup to the reaper, which runs every ~10 min (see this lua script). Can you confirm that you're still facing issues even with these latest reaper changes?

tylerstillwater commented 7 years ago

@shdunning I didn't know that functionality was in there. Thanks for the info!

I didn't let my code run with MaxConcurrency on for 10 minutes as I can't afford to let my queue not process events for 10 minutes. However, I have a staging environment where I could test this if necessary.

Regardless, I can't set MaxConcurrency on my production environment, even if this does solve my problem, because I can't wait 10 minutes for my queues to start working again. Would it be possible to run the reaper at startup to ensure everything is in a sane state?

In addition, looking at this, it seems the reaper looks for dead processes and decrements the lock value by whatever value is stored in the hash for that specific process. In my testing, I've seen lock value go negative in the hash. Given this, I feel it would be a good idea to either:

Change https://github.com/gocraft/work/blob/master/redis.go#L207 to get absolute value
Change https://github.com/gocraft/work/blob/master/redis.go#L213 to check if the lock count is greater than the max concurrency and reset it if so

shdunning commented 7 years ago

@tylerb multiple items to address, bear with me.

RE: reap cycles: The reaper does run immediately at startup, but unfortunately in most CI/CD environments the first reap cycle ends up being a no-op because the dead worker pool's heartbeat expiration is less than the deadTime. Since the heartbeats are refreshed every 5 seconds, we've discussed adjusting the deadTime to something substantially less than 5 minutes in order to determine whether or not a worker pool is dead which would increase the chances of successful reaping on that first reap cycle. I think this might be our best path forward to help avoid potential 10 minute deadlocks in your env.

RE: negative lock values: We noticed that too in our environments and PR #52 was designed to address the issue. We've been running with these changes since 6/22 (and tested in our staging env weeks before that) and have no longer observed the negative lock values. If you do continue to see them, then we've more work to do!

RE: your 2 proposed changes above:

I'm not exactly sure how the absolute val would help here. Please explain
A couple of considerations here: (1) We can't just reset the lock val because there may be active jobs running (2) We should never get into a state where lock count > max concurrency since running a job is an atomic operation where we check to see if lock count < max concurrency.

In general I'm not a huge fan of just correcting bad lock state on-the-fly. I'd rather understand the situations that are getting us into the bad state (like a dirty shutdown) and make the software more resilient to them.

Thanks for the diligence here -- we'll work through this!

tylerstillwater commented 7 years ago

Thanks for your time and consideration on all this, @shdunning. Much appreciated. I am certainly new to all this stuff, so I appreciate your explanation and discussion.

Reap cycles: Adjusting the deadTime does sound like a much better approach. In addition, adding a delay to the first reap after startup may also prove useful. Perhaps that would give dead workers a chance to expire.

Negative lock values: I got them immediately after setting MaxConcurrency:1 and pushing the new binary. I'm not entirely sure how it happened, unfortunately. Prior to that, there was no MaxConcurrency set, and I had 2 processes with 2 workers each. Perhaps I had some bad data in my lock keys for the job prior to pushing?

As for my two proposed changes:

If the value you get from the hash is negative, and you call decrby, you will increment the lock value. This will either cause the lock value to go > max concurrency or simply have locks taken for which there are no jobs. This is assuming, of course, that there are negative values in the hash.
I agree we should never get into a state where the lock count > max concurrency, but, currently it is possible to do so. If no max concurrency is set, the lock value is incremented. If MaxConcurrency is then set to a value lower than the current lock value in redis, we are now deadlocked. I have seen this happen on my environment.

I do not know exactly how it got into that state. I would assume that my jobs would terminate and decrement the lock value as part of a graceful shutdown. Perhaps heroku hard killed the binary before jobs completed, leaving the lock value in a bad state.

Please let me know what else I can do to assist with this. Thanks again!

tylerstillwater commented 7 years ago

Come to think of it, I know heroku has killed my worker processes many, many times in the past due to the bug I fixed in my previous PR: #54

As my redis data has never been reset, it's certainly possible my lock values were all kinds of screwed up.

shdunning commented 7 years ago

@tylerb yeah, unfortunately there was a bug in the reaper when MaxConcurrency was introduced that can leave locks in a bad state prior to the newest changes. Provided the old worker pool ids are still in the worker pools set, then the reaper should fix things back up. But as you mentioned, you can't wait 10 min for that 2nd reap cycle...let me put together a branch with some changes to make sure that first reap cycle runs. Maybe you can test with that?

More soon.

shdunning commented 7 years ago

Fixed by #58.

gocraft / work

MaxConcurrency does not work #55