python-zk / kazoo

Kazoo is a high-level Python library that makes it easier to use Apache Zookeeper.
https://kazoo.readthedocs.io
Apache License 2.0
1.3k stars 386 forks source link

self._attempts of KazooRetry will nerver grow up more than 1 when ConnectionDropped exception occur repeatedly in _connect_attempt #687

Closed thenumberouscode closed 6 months ago

thenumberouscode commented 1 year ago

Expected Behavior

when I set max_tries = 3 , I found that the retry still can be infinity. Does I misunderstand the meaning of max_tries? I think self._attempts should be grow up when retry actually occur.

Actual Behavior

now self._attempts is always change between 0 and 1 when ConnectionDropped exception occur repeatedly in _connect_attempt.

Snippet to Reproduce the Problem

COPY/PASTE the snippet here (omit any sensitive information)

Logs with logging in DEBUG mode

COPY/PASTE the result of the snippet here (omit any sensitive information)

Specifications

thenumberouscode commented 1 year ago

I handwrite a ConnectionDroped exception in connection.py line 587

Snipaste_2022-12-23_13-31-39
thenumberouscode commented 1 year ago

then I set max_tries=3, the log indicated that client retried many times. KazooClient(self.hosts, auth_data=[("digest", self.auth)], connection_retry={"max_tries":"3", "delay": 0.1, "backoff": 2, "max_delay": 3600}, logger=logger)

Snipaste_2022-12-23_20-34-24
a-ungurianu commented 1 year ago

Have you noticed this happening in the wild?

If the server was in a bad state, self._connect would fail which is only a few lines above.

Looking at the code, this would only cause this pathological case only if the server would always respond successfully to a connect but fail on a ping.

I guess the assumption would be that after a connection drop in the ping loop, we want to immediately reconnect after which either the problem got solved or the _connect call would throw which happens before the retry.reset() call that results in the status you notice.

IMO I think this assumption holds for the lifetime of a client, especially considering that failure in a _connect_attempt would result in the next connection being attempted on the next host in the list, which hopefully has a high chance of succeeding.

thenumberouscode commented 1 year ago

Yes, I think your understanding of the code logic is reasonable, the issue can be closed.