Closed maxjustus closed 12 years ago
Hmm, so one member of your ZooKeeper cluster was taken offline when you took the app server down? Did you see other redis_node_manager's swapping / becoming the primary, or just on that one box? The logs seem to indicate that there was a ZK connection error which caused that instance to lose its ZK lock. I'm wondering if the underlying ZK client had trouble reconnecting. Did you by any chance try restarting that redis_node_manager?
Ryan
On Fri, Oct 5, 2012 at 1:25 PM, Max Justus Spransy <notifications@github.com
wrote:
We're running 7 instances of redis node manager in production, one per app server. There was a connection timeout with zookeeper on a few clients ZK::Exceptions::OperationTimeOut: inputs: {:path=>"/redis_failover_nodes"} because we took down one of our app servers to move it to a different host. The loss of connection seemed to cause it to start switching between master node manager about once every 20 seconds with the error:
ZK::Exceptions::LockAssertionFailedError: we do not actually hold the lock
https://gist.github.com/3feb567ff0374be12757
Any ideas?
— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37.
the node managers all started shuffling around, becoming primary. So app-2 will be primary for 20 seconds and then raise that exception, and then app-5 will become primary and do the same thing, et-al. I didn't try restarting the managers, I can give that a shot.
That's really strange. I'm wondering if it had something to do with the way the exclusive locker is implemented in the ZK gem and the fact that an entire ZooKeeper node was moved to a different host. @slyphon, do you have any ideas here?
Max, can you ping me on gchat also? I'm lecompte@gmail.com
On Fri, Oct 5, 2012 at 1:44 PM, Max Justus Spransy <notifications@github.com
wrote:
the node managers all started shuffling around, becoming primary. So app-2 will be primary for 20 seconds and then raise that exception, and then app-5 will become primary and do the same thing, et-al. I didn't try restarting the managers, I can give that a shot.
— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37#issuecomment-9189200.
Also, how many nodes do you have in your ZooKeeper cluster? When you took down a node and moved it to a different host, did you update your ZK nodes config for redis_failover?
On Fri, Oct 5, 2012 at 1:49 PM, Ryan LeCompte lecompte@gmail.com wrote:
That's really strange. I'm wondering if it had something to do with the way the exclusive locker is implemented in the ZK gem and the fact that an entire ZooKeeper node was moved to a different host. @slyphon, do you have any ideas here?
Max, can you ping me on gchat also? I'm lecompte@gmail.com
On Fri, Oct 5, 2012 at 1:44 PM, Max Justus Spransy < notifications@github.com> wrote:
the node managers all started shuffling around, becoming primary. So app-2 will be primary for 20 seconds and then raise that exception, and then app-5 will become primary and do the same thing, et-al. I didn't try restarting the managers, I can give that a shot.
— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37#issuecomment-9189200.
We've got 7 zookeeper nodes (running on the same servers as the node managers). I didn't update the config since I didn't expect the server to be down for long and assumed it'd be somewhat resilient to a node going down now and again.
And yeah, one of the zookeeper nodes was taken offline, along with one of the redis_node_manager instances.
Gotcha. So, that would leave 6 zookeeper nodes running. When you brought the zookeeper node back online, did the hostname change at all?
On Fri, Oct 5, 2012 at 1:59 PM, Max Justus Spransy <notifications@github.com
wrote:
And yeah, one of the zookeeper nodes was taken offline, along with one of the redis_node_manager instances.
— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37#issuecomment-9189628.
I dunno, it hasn't come back up yet :)
Ok, so it's back up and the zk instance correctly rejoined the cluster. I'm still seeing the exception zk and in redis node manager gisted above. It looks like it's swapping between masters allot faster then I thought: https://gist.github.com/afce95a96838a4becd94 I'll see if bringing it down to one manager and bring each one back up one by one will fix it. I had this same issue on our staging servers and that seemed to do the trick.
Great, yes please give that a shot. Also, are you using redis_failover with ZK 1.7? That hasn't been tested and may be a cause of some of your issues, since I only test redis_failover with 1.6.x (which is what the gemspec specifies).
Keep me posted!
On Fri, Oct 5, 2012 at 2:11 PM, Max Justus Spransy <notifications@github.com
wrote:
Ok, so it's back up and the zk instance correctly rejoined the cluster. I'm still seeing the exception zk and in redis node manager gisted above. It looks like it's swapping between masters allot faster then I thought: https://gist.github.com/afce95a96838a4becd94 I'll see if bringing it down to one manager and bring each one back up one by one will fix it. I had this same issue on our staging servers and that seemed to do the trick.
— Reply to this email directly or view it on GitHubhttps://github.com/ryanlecompte/redis_failover/issues/37#issuecomment-9189930.
I'm going to close this issue as fixed in redis_failover 1.0 (to be released tomorrow AM). It relies on a new ZK version that has better locking cleanup. I was not able to repro this behavior with the latest ZK version.
We're running 7 instances of redis node manager in production, one per app server. There was a connection timeout with zookeeper on a few clients
ZK::Exceptions::OperationTimeOut: inputs: {:path=>"/redis_failover_nodes"}
because we took down one of our app servers to move it to a different host. The loss of connection seemed to cause it to start switching between master node manager about once every 20 seconds with the error:ZK::Exceptions::LockAssertionFailedError: we do not actually hold the lock
https://gist.github.com/3feb567ff0374be12757
Any ideas?