Occasionally stale server selection

netshade commented 11 years ago

We've been testing failover and while, by and large, the server selection works as expected, occasionally we see failover not selecting the right master server even after a node manager has correctly broadcast the changes to our client.

Taking a look at the code, it seems like the conditional server selection here (where presence in the stack overrides master selection if :master_only is supplied):

https://github.com/ryanlecompte/redis_failover/blob/master/lib/redis_failover/client.rb#L458

coupled with .pop usage on the thread local client stack here:

https://github.com/ryanlecompte/redis_failover/blob/master/lib/redis_failover/client.rb#L475

might be causing it. We replaced the .pop usage with .clear and emptied the stack entirely when the method was called, and we stopped seeing incorrect client selection. Does this seem right to you? Is that actually a problem?

ryanlecompte commented 11 years ago

Hmm, looking at the code I don't see how this could be a problem. The basic idea is that we should always #pop as many times as #client_for is called. Is there a particular redis call that's causing this for you? Is it happening in a simple call like #get or #set or in more complicated/nested calls like in #multi?

netshade commented 11 years ago

Loops on #incr, that's the only command. The client is initialized with :master_only => true . Setup was three node managers, one zookeeper, three instances of redis. Test script looked like:

loop do
 expected += 1
 actual = $client.incr("key")
 if expected != actual
  abort
  end
end

The comparison stuff worked as expected, but in the course of running that script while killing random servers, the client logger would report it had received the new master, but would immediately thereafter attempt to connect to the old master and write, either causing connection errors or read-only errors.

( Ruby 1.9.3 )

ryanlecompte / redis_failover

Occasionally stale server selection #42