Failed ZK connection retry strategy

youngkin commented 8 years ago

I'm testing connection failure scenarios between ZK (3.4.6) and a client written with go-zookeeper. When I kill the ZK server go-zookeeper retries the connection to ZK, but the retry loop seems to have no limit to the number of retries.

2015/10/02 11:42:16 Failed to connect to ...: dial tcp ...: connection refused

Is there anyway to influence how many times or how long go-zookeeper will continue to retry connecting to ZK? What will go-zookeeper return to the client if it exceeds the retry limit?

Thanks!

samuel commented 8 years ago

ZooKeeper is expected to always be alive so the client will retry forever. However, requests should return with ErrNoServer when the client has attempted to connect to all servers but failed.

Logic for that is here: https://github.com/samuel/go-zookeeper/blob/master/zk/conn.go#L231

It's possible for a user of the client to implement an overall timeout by watching the event stream and calling Close().

youngkin commented 8 years ago

Thanks for the quick response!

Regarding:

ZooKeeper is expected to always be alive so the client will retry forever. However, requests should return with ErrNoServer when the client has attempted to connect to all servers but failed.

I do observe the behavior described in the first sentence if there's a failure on a successfully opened connection. I see the behavior described in the second sentence if the failure occurs on the initial connection attempt.

Regarding:

It's possible for a user of the client to implement an overall timeout by watching the event stream and calling Close().

I'm not quite sure what you mean here by "watching the event stream". The client is monitoring the connection event channel. However, including the original disconnect event, the only other event received is the connecting event:

Received channel event: {EventSession StateDisconnected <nil> 192.168.12.11:2181} Received channel event: {EventSession StateConnecting <nil> 192.168.12.11:2181}

After that the library goes into the endless reconnect loop. There are no more events sent on the channel returned by zk.Connect().

There could be logic to put all the retry responsibility in the hands of the client - i.e., via the original disconnected event. Then the client could disconnect/connect until it's successful or until it reaches it's retry limit. Is this what you're suggesting?

I'm new to both zookeeper and Go, so it's possible I'm missing something.

samuel commented 8 years ago

I guess I meant more that it's possible to create a timeout by watching when the state is StateConnecting rather than retries. It's possible to detect retries on per-request level (e.g. Get, Children) as every time every server is attempted you'll get the ErrNoServer error, but on a global level just watching the event stream doesn't give the number of attempts.

Is there a specific use case where you'll need to stop the ZK server, or would you want the client to stop trying to connect if on the minority side of a partition? I would prefer not to implement a max retry count unless there's a pressing need as it seems atypical of ZK use to want the client to stop trying to connect.

youngkin commented 8 years ago

Hi Samuel,

I'd like to stop trying to connect if the client has lost a connection to ZK or is on the minority side of the partition beyond a certain timeout value.

I'm implementing a leadership election process for a reliable work queue. A job must either complete successfully or be picked up by another worker. Jobs are idempotent so it's not a problem to partially or fully re-execute them, I just don't want multiple processes working on the same job simultaneously. If the leader/owner gets partitioned/disconnected it must stop all processing on the job, preferably before the associated ephemeral node is removed by ZK. If the leader/owner fails then all this comes for free (more or less). Leader election will then pick a replacement.

So I need a way to detect a relatively permanent partition of the leader. This is what I was hoping to get out of the box with go-zookeeper. I can implement the timeout at the client level, I was just not expecting go-zookeeper to keep trying indefinitely.

Thanks, Rich On Oct 4, 2015 1:42 PM, "Samuel Stauffer" notifications@github.com wrote:

I guess I meant more that it's possible to create a timeout by watching when the state is StateConnecting rather than retries. It's possible to detect retries on per-request level (e.g. Get, Children) as every time every server is attempted you'll get the ErrNoServer error, but on a global level just watching the event stream doesn't give the number of attempts.

Is there a specific use case where you'll need to stop the ZK server, or would you want the client to stop trying to connect if on the minority side of a partition? I would prefer not to implement a max retry count unless there's a pressing need as it seems atypical of ZK use to want the client to stop trying to connect.

— Reply to this email directly or view it on GitHub https://github.com/samuel/go-zookeeper/issues/88#issuecomment-145380697.

youngkin commented 8 years ago

I'm curious what your thoughts are on this. If needed I'll implement the retry strategy I need in the client.

Thanks, Rich

yazgoo commented 6 years ago

Hi I also have the issue, I think a zookeeper client should not loop on trying to connect in all instances, since we have a oneshot client that needs to fail if zk is not available.

samuel / go-zookeeper

Failed ZK connection retry strategy #88