python-zk / kazoo

Kazoo is a high-level Python library that makes it easier to use Apache Zookeeper.
https://kazoo.readthedocs.io
Apache License 2.0
1.3k stars 386 forks source link

Kazoo heartbeats dont work with eventlet or monkey-patched threads #364

Open rjaiwal5139 opened 8 years ago

rjaiwal5139 commented 8 years ago

More details here: https://bugs.launchpad.net/python-tooz/+bug/1512001

bbangert commented 8 years ago

The bug cited includes this tidbit from the server:

 2015-11-03 18:37:37,380 - WARN [SyncThread:0:FileTxnLog@321] - fsync-ing the write ahead log in SyncThread:0 took 3633ms which will adversely effect operation latency. See the ZooKeeper troubleshooting guide

Is the heart-beat not working, or is the server experiencing issues which effect latency and therefore not responding to the ping quick enough?

Can someone verify that kazoo does not send heartbeat pings with eventlet independent of the rest of the bug cited here where the ZK server appears to be slammed or malfunctioning?

bbangert commented 8 years ago

BTW, the error afterwards (my above paste) should be especially disconcerting to the operation of the Zookeeper cluster:

 2015-11-03 18:37:37,392 - ERROR [CommitProcessor:0:NIOServerCnxn@180] - Unexpected Exception:
java.nio.channels.CancelledKeyException
        at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyImpl.java:73)
        at sun.nio.ch.SelectionKeyImpl.interestOps(SelectionKeyImpl.java:77)
        at org.apache.zookeeper.server.NIOServerCnxn.sendBuffer(NIOServerCnxn.java:153)
        at org.apache.zookeeper.server.NIOServerCnxn.sendResponse(NIOServerCnxn.java:1076)
        at org.apache.zookeeper.server.FinalRequestProcessor.processRequest(FinalRequestProcessor.java:404)
        at org.apache.zookeeper.server.quorum.CommitProcessor.run(CommitProcessor.java:74)

Zookeeper is running into errors while attempting to commit its sync log. Since Zookeeper is spinning its own event loop (NIO), if its blocked waiting for quorum and trying to write the sync-log, clients will not receive pings.

rjaiwal5139 commented 8 years ago

after patching the kazoo connection with https://github.com/python-zk/kazoo/pull/363 , i do see regular pings over time, but i get a lot of expired session messages like these:

(kazoo.client): 2015-11-13 14:33:24,379 WARNING connection _connect_attempt Session has expired (kazoo.client): 2015-11-13 14:33:24,379 INFO client _session_callback Zookeeper session lost, state: EXPIRED_SESSION (kazoo.client): 2015-11-13 14:57:52,985 INFO connection _connect Connecting to padawan-ccp-c1-m1-mgmt:2181 (kazoo.client): 2015-11-13 14:57:52,986 DEBUG connection _submit Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=0, time_out=10000, session_id=0, passwd='\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00', read_only=None) (kazoo.client): 2015-11-13 14:57:52,989 INFO client _session_callback Zookeeper connection established, state: CONNECTED (kazoo.client): 2015-11-13 14:57:52,999 DEBUG connection _submit Sending request(xid=1): GetChildren(path=u'/tooz/ceilometer.notification', watcher=<bound method ChildrenWatch._watcher of <kazoo.recipe.watchers.ChildrenWatch object at 0x7fd3e7091cd0>>) (kazoo.client): 2015-11-13 14:57:53,011 DEBUG connection _read_response Received response(xid=1): []

Querying Zookeeper using shell returns empty:

[zk: localhost:2181(CONNECTED) 9] ls /tooz/ceilometer.notification []

bbangert commented 8 years ago

When you say you see regular pings, does that mean you see kazoo sending them more frequently, or do you see responses?

If the underlying problem is that Zookeeper still cannot sync properly then sending more pings will not keep the session active since Zookeeper won't process them within the session lifetime.

rjaiwal5139 commented 8 years ago

I think kazoo is sending them as defined (in regular intervals) but the response i empty. When the agents are restarted, the response is there, i get 3 uuids for the 3 agents, the same is returned by the zookeeper shell, but soon after that, the agent logs show session expiry and empty response is returned on all 3 agents and also the zookeeper shell for ls /tooz/ceilometer.notification as shown above. Things stop working when session expires..

harlowja commented 8 years ago

So we chatted on IRC, rjaiwal5139 is going to do some testing by putting zookeeper on its own hardware (separated from other VMs) and report back if this issues still occurs when this change is done...

Useful link for folks:

https://zookeeper.apache.org/doc/r3.1.2/zookeeperAdmin.html#Single+Machine+Requirements

rjaiwal5139 commented 8 years ago

Slight correction in my earlier response - i was testing with 3 VM instances all sharing the same baremetal with ZK running on all 3, not just one, however i was passing in just one hostname to Kazoo,

Initial observation on a multi-node test install shows Zookeeper without any of the sync errors on my local setup. Does Kazoo handle failover among ZK hosts when more than one host is specified?

bbangert commented 8 years ago

In the event kazoo pings out on a server, it will move to the next server in the list, yes. I believe it'll also separate out hosts if multiple ones are found for a single DNS name to rotate amongst as well.