ryanlecompte / redis_failover

redis_failover is a ZooKeeper-based automatic master/slave failover solution for Ruby.
http://github.com/ryanlecompte/redis_failover
MIT License
539 stars 65 forks source link

Continuously swapping master redis_node_manager #37

Closed maxjustus closed 12 years ago

maxjustus commented 12 years ago

We're running 7 instances of redis node manager in production, one per app server. There was a connection timeout with zookeeper on a few clients ZK::Exceptions::OperationTimeOut: inputs: {:path=>"/redis_failover_nodes"} because we took down one of our app servers to move it to a different host. The loss of connection seemed to cause it to start switching between master node manager about once every 20 seconds with the error:

ZK::Exceptions::LockAssertionFailedError: we do not actually hold the lock

https://gist.github.com/3feb567ff0374be12757

Any ideas?

ryanlecompte commented 12 years ago

Hmm, so one member of your ZooKeeper cluster was taken offline when you took the app server down? Did you see other redis_node_manager's swapping / becoming the primary, or just on that one box? The logs seem to indicate that there was a ZK connection error which caused that instance to lose its ZK lock. I'm wondering if the underlying ZK client had trouble reconnecting. Did you by any chance try restarting that redis_node_manager?

Ryan

On Fri, Oct 5, 2012 at 1:25 PM, Max Justus Spransy <notifications@github.com

wrote:

We're running 7 instances of redis node manager in production, one per app server. There was a connection timeout with zookeeper on a few clients ZK::Exceptions::OperationTimeOut: inputs: {:path=>"/redis_failover_nodes"} because we took down one of our app servers to move it to a different host. The loss of connection seemed to cause it to start switching between master node manager about once every 20 seconds with the error:

ZK::Exceptions::LockAssertionFailedError: we do not actually hold the lock

https://gist.github.com/3feb567ff0374be12757

Any ideas?

— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37.

maxjustus commented 12 years ago

the node managers all started shuffling around, becoming primary. So app-2 will be primary for 20 seconds and then raise that exception, and then app-5 will become primary and do the same thing, et-al. I didn't try restarting the managers, I can give that a shot.

ryanlecompte commented 12 years ago

That's really strange. I'm wondering if it had something to do with the way the exclusive locker is implemented in the ZK gem and the fact that an entire ZooKeeper node was moved to a different host. @slyphon, do you have any ideas here?

Max, can you ping me on gchat also? I'm lecompte@gmail.com

On Fri, Oct 5, 2012 at 1:44 PM, Max Justus Spransy <notifications@github.com

wrote:

the node managers all started shuffling around, becoming primary. So app-2 will be primary for 20 seconds and then raise that exception, and then app-5 will become primary and do the same thing, et-al. I didn't try restarting the managers, I can give that a shot.

— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37#issuecomment-9189200.

ryanlecompte commented 12 years ago

Also, how many nodes do you have in your ZooKeeper cluster? When you took down a node and moved it to a different host, did you update your ZK nodes config for redis_failover?

On Fri, Oct 5, 2012 at 1:49 PM, Ryan LeCompte lecompte@gmail.com wrote:

That's really strange. I'm wondering if it had something to do with the way the exclusive locker is implemented in the ZK gem and the fact that an entire ZooKeeper node was moved to a different host. @slyphon, do you have any ideas here?

Max, can you ping me on gchat also? I'm lecompte@gmail.com

On Fri, Oct 5, 2012 at 1:44 PM, Max Justus Spransy < notifications@github.com> wrote:

the node managers all started shuffling around, becoming primary. So app-2 will be primary for 20 seconds and then raise that exception, and then app-5 will become primary and do the same thing, et-al. I didn't try restarting the managers, I can give that a shot.

— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37#issuecomment-9189200.

maxjustus commented 12 years ago

We've got 7 zookeeper nodes (running on the same servers as the node managers). I didn't update the config since I didn't expect the server to be down for long and assumed it'd be somewhat resilient to a node going down now and again.

maxjustus commented 12 years ago

And yeah, one of the zookeeper nodes was taken offline, along with one of the redis_node_manager instances.

ryanlecompte commented 12 years ago

Gotcha. So, that would leave 6 zookeeper nodes running. When you brought the zookeeper node back online, did the hostname change at all?

On Fri, Oct 5, 2012 at 1:59 PM, Max Justus Spransy <notifications@github.com

wrote:

And yeah, one of the zookeeper nodes was taken offline, along with one of the redis_node_manager instances.

— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37#issuecomment-9189628.

maxjustus commented 12 years ago

I dunno, it hasn't come back up yet :)

maxjustus commented 12 years ago

Ok, so it's back up and the zk instance correctly rejoined the cluster. I'm still seeing the exception zk and in redis node manager gisted above. It looks like it's swapping between masters allot faster then I thought: https://gist.github.com/afce95a96838a4becd94 I'll see if bringing it down to one manager and bring each one back up one by one will fix it. I had this same issue on our staging servers and that seemed to do the trick.

ryanlecompte commented 12 years ago

Great, yes please give that a shot. Also, are you using redis_failover with ZK 1.7? That hasn't been tested and may be a cause of some of your issues, since I only test redis_failover with 1.6.x (which is what the gemspec specifies).

Keep me posted!

On Fri, Oct 5, 2012 at 2:11 PM, Max Justus Spransy <notifications@github.com

wrote:

Ok, so it's back up and the zk instance correctly rejoined the cluster. I'm still seeing the exception zk and in redis node manager gisted above. It looks like it's swapping between masters allot faster then I thought: https://gist.github.com/afce95a96838a4becd94 I'll see if bringing it down to one manager and bring each one back up one by one will fix it. I had this same issue on our staging servers and that seemed to do the trick.

— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37#issuecomment-9189930.

ryanlecompte commented 12 years ago

I'm going to close this issue as fixed in redis_failover 1.0 (to be released tomorrow AM). It relies on a new ZK version that has better locking cleanup. I was not able to repro this behavior with the latest ZK version.