Open colinmcintosh opened 4 years ago
Is this similer to what we see in #17?
Hi there,
I'm able to reproduce this with https://github.com/samuel/go-zookeeper: I'm getting the same authentication failed: EOF
error if all zookeeper nodes are unavailable at the same time and get replaced. If the issue is only due to the network, the connection recovers ok but if the zookeeper nodes get replaced (a restart is enough), the app needs to be restarted since it will try to reconnect with the stale session and fail to do so. I would say that this is the same behaviour as described in https://github.com/go-zookeeper/zk/issues/17.
I tried to investigate this issue more but the problem here is that I don't know how to differentiate between an io.EOF
error because there is no quorum and one because the whole cluster has been reset.
I tried to implement a fix based on https://github.com/go-zookeeper/zk/pull/37, adding an additional session reset when getting a io.EOF
error but this is a too broad clause and will cause the session to be reset even when there is no need to (e.g. if there is no quorum).
Interesting. It's odd that the zookeeper server just end the connection if the session could not be found vs a more specific error (or perhaps we aren't parsing the error?)
Has anyone dug through the java client code to see if there is some special handling there?
I created a patch to workaround this. It's based on a different fork, but it should be compatible. Reset_session_state_when_entire_quorum_losses_our_session.patch.txt
TL;DR; The client keeps track of how long ago was the last successful authentication, and if we see an auth request fail with io.EOF
AFTER the session timeout elapses, we have to assume the ZK quorum lost our session. To recover, just reset the clients session state and invalidate watches. This seems reasonable to me, and it appears to work as expected when I restart a ZK in a single-node case.
Shall I open a PR?
This is the commit on the fork https://github.com/jpfourny/zk/commit/cd8998aad6c22c257ca8c4635e21e45ee88040df I think this is a reasonable patch and better than simply getting stalled.
@jpfourny Your patch looks pretty similar to the PR I submitted above with this issue originally. Looking through it I think the only major difference between them is your call to invalidateWatches
. If you think they are sufficiently similar enough I can update the existing PR with that call as well.
https://github.com/go-zookeeper/zk/pull/37/commits/0607feec476bd9dc734045c7ce82d0c13c3c3222
@jpfourny Your patch looks pretty similar to the PR I submitted above with this issue originally. Looking through it I think the only major difference between them is your call to
invalidateWatches
. If you think they are sufficiently similar enough I can update the existing PR with that call as well.
That works. I do recommend you move c.invalidateWatches(err)
into c.resetSession
. I imagine watches would be no good if the session is reset, anyway.
I vote for getting this merged. If we lost quorum for longer than session timeout, then our session would have expired anyway. I feel it's safe for the client to abandon their session at this point, no matter what the state of the quorum actually is.
I updated PR #37 to include the call to c.invalidateWatches
in c.resetSession
as @jpfourny suggested.
Hi, Is there still ongoing work on this issue
How can I recreate this scenario with a Zookeeper cluster (single- or 3-node) and a client that uses go-zookeeper? Any suggestions @jpfourny @colinmcintosh ?
Hi, I noticed an issue where the go-zookeeper client will not reset the session if the Zookeeper cluster loses it's session state (instance reset, redeploy, etc) and the server responds with an EOF. When this happens the client enters an endless loop trying to authenticate with the previous Session ID. It looks like the loop happens here: https://github.com/go-zookeeper/zk/blob/50daf81d01a3f5d6209441a9e4d7f6e7d32fab64/conn.go#L435
I haven't had a chance to look through the code yet for this but I'm wondering if anyone knows what process needs to happen to reset the session in the client?