Open royaltm opened 11 years ago
Excellent catch, @royaltm!
So it seems that if we change this logic:
B: get > now ? NO (get is 0)
to this:
B: get (is nil or) > now ? YES (get is nil)
the problem would be solved?
This should solve the issue: https://github.com/kenn/redis-mutex/pull/6
Hello! Your patch solves the issue in this particular example. However it's only a petty trick to workaround the problem. The problem with redis-mutex pattern is that read - check - write operations are not atomic. They are always prone to race conditions.
I've just imagined another example that leads to much more harmfull situation:
now = 2 A has locked the resource but it's key value has already expired (with expire_at=1) B tries to lock B: setnx(3) ? NO B: get > now ? NO (get is 1) A tries to unlock A: get == 1 ? YES B: getset(3) <= now ? YES A: del
at the end B seems to be the owner of the lock but virtually the key's been deleted by A. In this case B has acquired the lock on premature assumption. It's much worse cause B is now fiddling in the unprotected critical section.
Of course you might still want to go that way adjusting this algorithm forewer hoping that no other race condition pattern emerges.
However if you want redis-mutex to be truly robust IMHO you should either:
I would start with changing unlock in the following manner:
WATCH key
GET key == me ?
YES: MULTI
DEL key
NO: UNWATCH
Please look here. This one has well designed locking pattern.
The unlock test is indeed racy, it's another issue. The watch solves it yes. How about a lua script to avoid roundtrips? https://github.com/crowdtap/promiscuous/blob/master/lib/promiscuous/redis.rb#L132-L151
Scrap that, it's racy with the a race on the getset. (@expires_at won't be the same). I guess I'll go with a lua script for both sides. (I need to take multiple locks at the same time)
https://github.com/crowdtap/promiscuous/blob/master/lib/promiscuous/redis.rb#L116-L154 lock/unlock is now a LUA script. Much easier, and probably faster (only one network roundtrip for each operation).
for multiple locks you might take a look at my gem: (EM only) https://github.com/royaltm/redis-em-mutex
I don't like EM, I like Celluloid, and I need something thread safe. Quite honestly, this code is unreadable: https://github.com/royaltm/redis-em-mutex/blob/master/lib/redis/em-mutex/script_handler.rb#L162-L213 way too much complexity for me to verify that this is working as intended. But thanks :)
probably depends on who's reading it :), but no problem
Number of projects are still using Redis 2.4 and Lua scripting is not supported with them. I would go with optimistic locking with WATCH at this time. The Lua-based logic would be a nice addition for Redis 2.6, though.
I'm going to change this code in try_lock
return false if get.to_f > now
to this
return false if (old_value = get).nil? or old_value.to_f > now
and fix the first case.
Then for unlock
, I'll try implementing the WATCH command. How does that sound?
That sounds good, but you are still screwed, because you can getset() a lock that you don't own. And so the unlock check doesn't work on a lock you've recovered while someone was racing with you.
The point of this fix is that you never get to getset
, as it returns false immediately. In other words, the get
checks two things - if the lock is still effective (get.to_f > now) or anyone released the lock in the middle (get.nil?). Am I still missing something?
After that, you need getset to acquire an expired lock. If you raced with someone while doing the getset, unlock is broken, because the lock won't contain @expired_at
proposition of the locking pattern:
until setnx # loop until succeeded, this is the only place where we can be sure locked key belong to us
watch do
if (old_value = get).nil? # try again it's probably free now
unwatch
next
end
if old_value.to_f > now
unwatch
return false # someone else keeps valid lock, leave now
else # get < now - possibly expired
# now try to delete expired lock
return false unless multi {|m| m.del } # someone else has messed with this key, leave now
# or try again
end
end
end
return true
together with the unlock above will solve the thing
@royaltm :+1:
A caveat however, how does watch work when multiple threads are using the same redis connection?
RTFS https://github.com/redis/redis-rb/blob/master/lib/redis.rb#L1995 - synchronize to the rescue
I did read the fucking sources.
Thread1: watch Thread2: get XX Thread1: get YY
Thread1 will have the "get" of thread2 in its watch. Point is, watch doesn't work well with threads and single redis connection.
F stated for Fine ;) "doesn't work well" why assume? watch with block is secured with monitor, so only one thread will watch and do stuff inside at the same time
The synchronization that you get is this one:
Thread1: lock.synchronize { watch } Thread2: lock.synchronize { get XX } Thread1: lock.synchronize { get YY }
Thread1 did not expect to have XX watched. That's a problem.
nope, I see it differently: the synchronization is this actually, as yield is inside synchronize block in watch Thread1: lock.synchronize { watch do blablabla end } Thread2: lock.synchronize { watch do blablabla end }
OHhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh ok. I misread the sources :D
don't misread your sources then :)
so :+1: for your solution, I don't see any issues with it.
the thing is, if and how watch/multi is supported in Redis::Classy? @kenn ?
It's using method missing, so it should be just fine. I don't like Classy though :)
Every man to his taste.
Yes
is this still an open issue or is it solved ?
I've created a PR for this at #27 - if anyone can review, I'd appreciate it.
The pattern from the official SETNX doc used in @kenn's implementation is flawed. The implementation itself however is really nice and I like it. Nevertheless my bad habit of testing other people's ideas (especially before implementing them and using on production) pushed me to play a little bit with redis-mutex gem.
Unfortunately the setnx/getset pattern may lead to the "phantom starvation" (what I call it) of the two (or more) concurrent processes (or threads) when both try to enter the critical section at the wrong time.
Here's a simple stress test:
Now try to spawn two terminal windows (or create split window in tmux) and run the above script on both of them. When you start the first one, you'll see counting. But in a short moment (or even at the beginning) when you run it the second time the counting stops periodically. The scripts will eventually reach 9999 (if you are patient enough) but mostly sleeping, waiting for lock to expire.
I might suppose @kenn is already aware of that but since there is no BUGS/LIMITATIONS section in README file I wrote this issue for the record.
I also think it is ours (the coders') responsibility to warn others about known issues and limitations especially when you are the keeper of the quite popular rubygem name. Noblesse Oblige (and popularity should too) in the open source crowd. :)
Here are the reconstructed steps behind the vulnerability:
For clarity there are three actors in the example (A, B, C) though C could also be A.
now = 1 A has locked the resource (with expire_at=2) A tries to unlock, at the same time B tries to lock (with expire_at=3) A: get == 2 ? YES B: setnx(3) ? NO A: del B: get > now ? NO (get is 0) C now tries to lock with (expire_at=4) C: setnx(4) ? YES -> C is the owner now B: getset(3) <= now ? NO at the end B is not the owner but has managed to alter the lock
therefore C won't unlock: C: get == 4 ? NO (get is 3) so the lock remains staled and all the hungry processes must wait until it expires