go-zookeeper / zk

Native ZooKeeper client for Go
BSD 3-Clause "New" or "Revised" License
506 stars 130 forks source link

Zookeeper client session is not reset when cluster loses session state (authentication failed: EOF) #36

Open colinmcintosh opened 4 years ago

colinmcintosh commented 4 years ago

Hi, I noticed an issue where the go-zookeeper client will not reset the session if the Zookeeper cluster loses it's session state (instance reset, redeploy, etc) and the server responds with an EOF. When this happens the client enters an endless loop trying to authenticate with the previous Session ID. It looks like the loop happens here: https://github.com/go-zookeeper/zk/blob/50daf81d01a3f5d6209441a9e4d7f6e7d32fab64/conn.go#L435

I haven't had a chance to look through the code yet for this but I'm wondering if anyone knows what process needs to happen to reset the session in the client?

nemith commented 3 years ago

Is this similer to what we see in #17?

mathispesch commented 3 years ago

Hi there, I'm able to reproduce this with https://github.com/samuel/go-zookeeper: I'm getting the same authentication failed: EOF error if all zookeeper nodes are unavailable at the same time and get replaced. If the issue is only due to the network, the connection recovers ok but if the zookeeper nodes get replaced (a restart is enough), the app needs to be restarted since it will try to reconnect with the stale session and fail to do so. I would say that this is the same behaviour as described in https://github.com/go-zookeeper/zk/issues/17.

mathispesch commented 3 years ago

I tried to investigate this issue more but the problem here is that I don't know how to differentiate between an io.EOF error because there is no quorum and one because the whole cluster has been reset. I tried to implement a fix based on https://github.com/go-zookeeper/zk/pull/37, adding an additional session reset when getting a io.EOF error but this is a too broad clause and will cause the session to be reset even when there is no need to (e.g. if there is no quorum).

nemith commented 3 years ago

Interesting. It's odd that the zookeeper server just end the connection if the session could not be found vs a more specific error (or perhaps we aren't parsing the error?)

Has anyone dug through the java client code to see if there is some special handling there?

jpfourny commented 2 years ago

I created a patch to workaround this. It's based on a different fork, but it should be compatible. Reset_session_state_when_entire_quorum_losses_our_session.patch.txt

TL;DR; The client keeps track of how long ago was the last successful authentication, and if we see an auth request fail with io.EOF AFTER the session timeout elapses, we have to assume the ZK quorum lost our session. To recover, just reset the clients session state and invalidate watches. This seems reasonable to me, and it appears to work as expected when I restart a ZK in a single-node case.

Shall I open a PR?

inf-rno commented 2 years ago

This is the commit on the fork https://github.com/jpfourny/zk/commit/cd8998aad6c22c257ca8c4635e21e45ee88040df I think this is a reasonable patch and better than simply getting stalled.

colinmcintosh commented 2 years ago

@jpfourny Your patch looks pretty similar to the PR I submitted above with this issue originally. Looking through it I think the only major difference between them is your call to invalidateWatches. If you think they are sufficiently similar enough I can update the existing PR with that call as well.

https://github.com/go-zookeeper/zk/pull/37/commits/0607feec476bd9dc734045c7ce82d0c13c3c3222

jpfourny commented 2 years ago

@jpfourny Your patch looks pretty similar to the PR I submitted above with this issue originally. Looking through it I think the only major difference between them is your call to invalidateWatches. If you think they are sufficiently similar enough I can update the existing PR with that call as well.

0607fee

That works. I do recommend you move c.invalidateWatches(err) into c.resetSession. I imagine watches would be no good if the session is reset, anyway.

I vote for getting this merged. If we lost quorum for longer than session timeout, then our session would have expired anyway. I feel it's safe for the client to abandon their session at this point, no matter what the state of the quorum actually is.

colinmcintosh commented 2 years ago

I updated PR #37 to include the call to c.invalidateWatches in c.resetSession as @jpfourny suggested.

LittleCuteBug commented 1 year ago

Hi, Is there still ongoing work on this issue

amukherj commented 5 months ago

How can I recreate this scenario with a Zookeeper cluster (single- or 3-node) and a client that uses go-zookeeper? Any suggestions @jpfourny @colinmcintosh ?