kenn / redis-mutex

Distributed mutex using Redis
https://rubygems.org/gems/redis-mutex
MIT License
282 stars 44 forks source link

known pattern flow #5

Open royaltm opened 11 years ago

royaltm commented 11 years ago

The pattern from the official SETNX doc used in @kenn's implementation is flawed. The implementation itself however is really nice and I like it. Nevertheless my bad habit of testing other people's ideas (especially before implementing them and using on production) pushed me to play a little bit with redis-mutex gem.

Unfortunately the setnx/getset pattern may lead to the "phantom starvation" (what I call it) of the two (or more) concurrent processes (or threads) when both try to enter the critical section at the wrong time.

Here's a simple stress test:

require 'redis-mutex'

REDIS_OPTS = {}

lck = '__STRESSTEST__.lck'
Redis::Classy.db = Redis.new(REDIS_OPTS)
rmutex = Redis::Mutex.new(lck, expire: 5, block:10000, sleep:0.01)
10000.times do |i|
  rmutex.with_lock do
    print "\r %6d" % i
    sleep rand/100
  end
end

Now try to spawn two terminal windows (or create split window in tmux) and run the above script on both of them. When you start the first one, you'll see counting. But in a short moment (or even at the beginning) when you run it the second time the counting stops periodically. The scripts will eventually reach 9999 (if you are patient enough) but mostly sleeping, waiting for lock to expire.

I might suppose @kenn is already aware of that but since there is no BUGS/LIMITATIONS section in README file I wrote this issue for the record.

I also think it is ours (the coders') responsibility to warn others about known issues and limitations especially when you are the keeper of the quite popular rubygem name. Noblesse Oblige (and popularity should too) in the open source crowd. :)

Here are the reconstructed steps behind the vulnerability:

For clarity there are three actors in the example (A, B, C) though C could also be A.

now = 1 A has locked the resource (with expire_at=2) A tries to unlock, at the same time B tries to lock (with expire_at=3) A: get == 2 ? YES B: setnx(3) ? NO A: del B: get > now ? NO (get is 0) C now tries to lock with (expire_at=4) C: setnx(4) ? YES -> C is the owner now B: getset(3) <= now ? NO at the end B is not the owner but has managed to alter the lock

therefore C won't unlock: C: get == 4 ? NO (get is 3) so the lock remains staled and all the hungry processes must wait until it expires

kenn commented 11 years ago

Excellent catch, @royaltm!

So it seems that if we change this logic:

B: get > now ? NO (get is 0)

to this:

B: get (is nil or) > now ? YES (get is nil)

the problem would be solved?

nviennot commented 11 years ago

This should solve the issue: https://github.com/kenn/redis-mutex/pull/6

royaltm commented 11 years ago

Hello! Your patch solves the issue in this particular example. However it's only a petty trick to workaround the problem. The problem with redis-mutex pattern is that read - check - write operations are not atomic. They are always prone to race conditions.

I've just imagined another example that leads to much more harmfull situation:

now = 2 A has locked the resource but it's key value has already expired (with expire_at=1) B tries to lock B: setnx(3) ? NO B: get > now ? NO (get is 1) A tries to unlock A: get == 1 ? YES B: getset(3) <= now ? YES A: del

at the end B seems to be the owner of the lock but virtually the key's been deleted by A. In this case B has acquired the lock on premature assumption. It's much worse cause B is now fiddling in the unprotected critical section.

Of course you might still want to go that way adjusting this algorithm forewer hoping that no other race condition pattern emerges.

However if you want redis-mutex to be truly robust IMHO you should either:

I would start with changing unlock in the following manner:

WATCH key
GET key == me ?
YES: MULTI
     DEL key
NO:  UNWATCH

Please look here. This one has well designed locking pattern.

nviennot commented 11 years ago

The unlock test is indeed racy, it's another issue. The watch solves it yes. How about a lua script to avoid roundtrips? https://github.com/crowdtap/promiscuous/blob/master/lib/promiscuous/redis.rb#L132-L151

nviennot commented 11 years ago

Scrap that, it's racy with the a race on the getset. (@expires_at won't be the same). I guess I'll go with a lua script for both sides. (I need to take multiple locks at the same time)

nviennot commented 11 years ago

https://github.com/crowdtap/promiscuous/blob/master/lib/promiscuous/redis.rb#L116-L154 lock/unlock is now a LUA script. Much easier, and probably faster (only one network roundtrip for each operation).

royaltm commented 11 years ago

for multiple locks you might take a look at my gem: (EM only) https://github.com/royaltm/redis-em-mutex

nviennot commented 11 years ago

I don't like EM, I like Celluloid, and I need something thread safe. Quite honestly, this code is unreadable: https://github.com/royaltm/redis-em-mutex/blob/master/lib/redis/em-mutex/script_handler.rb#L162-L213 way too much complexity for me to verify that this is working as intended. But thanks :)

royaltm commented 11 years ago

probably depends on who's reading it :), but no problem

kenn commented 11 years ago

Number of projects are still using Redis 2.4 and Lua scripting is not supported with them. I would go with optimistic locking with WATCH at this time. The Lua-based logic would be a nice addition for Redis 2.6, though.

I'm going to change this code in try_lock

return false  if get.to_f > now

to this

return false  if (old_value = get).nil? or old_value.to_f > now

and fix the first case.

Then for unlock, I'll try implementing the WATCH command. How does that sound?

nviennot commented 11 years ago

That sounds good, but you are still screwed, because you can getset() a lock that you don't own. And so the unlock check doesn't work on a lock you've recovered while someone was racing with you.

kenn commented 11 years ago

The point of this fix is that you never get to getset, as it returns false immediately. In other words, the get checks two things - if the lock is still effective (get.to_f > now) or anyone released the lock in the middle (get.nil?). Am I still missing something?

nviennot commented 11 years ago

After that, you need getset to acquire an expired lock. If you raced with someone while doing the getset, unlock is broken, because the lock won't contain @expired_at

royaltm commented 11 years ago

proposition of the locking pattern:

until setnx # loop until succeeded, this is the only place where we can be sure locked key belong to us
  watch do
    if (old_value = get).nil? # try again it's probably free now
      unwatch
      next
    end
    if old_value.to_f > now
      unwatch
      return false # someone else keeps valid lock, leave now
    else # get < now - possibly expired
      # now try to delete expired lock
      return false unless multi {|m| m.del } # someone else has messed with this key, leave now
      # or try again
    end
  end
end
return true

together with the unlock above will solve the thing

nviennot commented 11 years ago

@royaltm :+1:

nviennot commented 11 years ago

A caveat however, how does watch work when multiple threads are using the same redis connection?

royaltm commented 11 years ago

RTFS https://github.com/redis/redis-rb/blob/master/lib/redis.rb#L1995 - synchronize to the rescue

nviennot commented 11 years ago

I did read the fucking sources.

Thread1: watch Thread2: get XX Thread1: get YY

Thread1 will have the "get" of thread2 in its watch. Point is, watch doesn't work well with threads and single redis connection.

royaltm commented 11 years ago

F stated for Fine ;) "doesn't work well" why assume? watch with block is secured with monitor, so only one thread will watch and do stuff inside at the same time

nviennot commented 11 years ago

The synchronization that you get is this one:

Thread1: lock.synchronize { watch } Thread2: lock.synchronize { get XX } Thread1: lock.synchronize { get YY }

Thread1 did not expect to have XX watched. That's a problem.

royaltm commented 11 years ago

nope, I see it differently: the synchronization is this actually, as yield is inside synchronize block in watch Thread1: lock.synchronize { watch do blablabla end } Thread2: lock.synchronize { watch do blablabla end }

nviennot commented 11 years ago

OHhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh ok. I misread the sources :D

royaltm commented 11 years ago

don't misread your sources then :)

nviennot commented 11 years ago

so :+1: for your solution, I don't see any issues with it.

royaltm commented 11 years ago

the thing is, if and how watch/multi is supported in Redis::Classy? @kenn ?

nviennot commented 11 years ago

It's using method missing, so it should be just fine. I don't like Classy though :)

royaltm commented 11 years ago

Every man to his taste.

nviennot commented 11 years ago

Yes

mkristian commented 7 years ago

is this still an open issue or is it solved ?

kenn commented 7 years ago

I've created a PR for this at #27 - if anyone can review, I'd appreciate it.