fix(core): Adjust connection timeout and retry logic

ceache commented 1 year ago

Why is this needed?

Do not use the session timeout as connection timeout. This value is too large and results in a bad non-responsive server holding up the client long enough for the session to timeout.

Proposed Changes

Use the KazooRetry object to use an increasing backoff timeout and cycle over all servers quickly, working around bad servers with minimal impact.

Does this PR introduce any breaking change?

This change should be transparent at a high level. It makes the client connection attempts more aggressive, on purpose.

ceache commented 1 year ago

@python-zk/maintainers This is an early draft of fix to the connection logic. The original issue that sent me on this fix is as follow:

zkc = KazooClient(hosts=["nonresponsive:2181", "goodhost:2181"], timeout=60)
zkc.start()

On a new connection, it results in a painful delay of 60 seconds wasted on the bad server before trying the good one. On a reconnect, it is worse. It results in the connection being stuck on an attempt on the bad server with a TCP timeout set to 60 seconds, i.e. equal to the session timeout. By the time it gives up, and moves on to trying the good host, the session is already lost.

codecov-commenter commented 1 year ago

Codecov Report

Patch coverage: 100.00% and project coverage change: -2.26 :warning:

Comparison is base (2c36d69) 96.65% compared to head (fd57818) 94.40%.

:exclamation: Current head fd57818 differs from pull request most recent head e34e407. Consider uploading reports for the commit e34e407 to get more accurate results

Additional details and impacted files

```diff @@ Coverage Diff @@ ## master #685 +/- ## ========================================== - Coverage 96.65% 94.40% -2.26% ========================================== Files 27 58 +31 Lines 3554 8413 +4859 ========================================== + Hits 3435 7942 +4507 - Misses 119 471 +352 ``` | [Impacted Files](https://codecov.io/gh/python-zk/kazoo/pull/685?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-zk) | Coverage Δ | | |---|---|---| | [kazoo/protocol/connection.py](https://codecov.io/gh/python-zk/kazoo/pull/685?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-zk#diff-a2F6b28vcHJvdG9jb2wvY29ubmVjdGlvbi5weQ==) | `96.28% <100.00%> (-1.44%)` | :arrow_down: | | [kazoo/retry.py](https://codecov.io/gh/python-zk/kazoo/pull/685?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-zk#diff-a2F6b28vcmV0cnkucHk=) | `100.00% <100.00%> (ø)` | | | [kazoo/tests/test\_connection.py](https://codecov.io/gh/python-zk/kazoo/pull/685?src=pr&el=tree&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-zk#diff-a2F6b28vdGVzdHMvdGVzdF9jb25uZWN0aW9uLnB5) | `100.00% <100.00%> (ø)` | | ... and [41 files with indirect coverage changes](https://codecov.io/gh/python-zk/kazoo/pull/685/indirect-changes?src=pr&el=tree-more&utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-zk) Help us with your feedback. Take ten seconds to tell us [how you rate us](https://about.codecov.io/nps?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-zk). Have a feature suggestion? [Share it here.](https://app.codecov.io/gh/feedback/?utm_medium=referral&utm_source=github&utm_content=comment&utm_campaign=pr+comments&utm_term=python-zk)

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.

ceache commented 1 year ago

I am pondering if the below should be changed too.

I really do not think session_timeout_in_sec / number_of_hosts is appropriate here. I would like it to use the client's connection_retry or even command_retry.

        connect_result, zxid = self._invoke(
            client._session_timeout / 1000.0 / len(client.hosts), connect
        )

StephenSorriaux commented 1 year ago

Just like the others, I think the changes (+ explanations) make a lot of sense. Although, just like @ztzg commented, the new default value might seem a little low from what the previous was (100ms vs 10s). I like the changes making the client less lazy but I am not sure if that kind of default value would suit to every current possible uses of the client and, if possible, I would prefer we try not to change it. I noticed that in ConnectionHandler, the retry_sleeper is only used for the zk_loop, do you think some logic around the initial connection_retry value passed to the client would be doable/make sense? My (possibly naive) idea would be to use your new behavior if the connection_retry is given (ie. the user was kind of expecting this behavior), but keep the previous value if connection_retry is None (ie. just keep things like before).

ztzg commented 3 months ago

Hi @StephenSorriaux, all,

It would be great to have this fixed in master, as teams using unpatched versions of Kazoo regularly experience "outages."

Just like the others, I think the changes (+ explanations) make a lot of sense. Although, just like @ztzg commented, the new default value might seem a little low from what the previous was (100ms vs 10s).

Note that 10s is (supposed to be) the session lifetime, and that by using it as a connection timeout, the session is dead by the time any unresponsive TCP/IP operation aborts. Which is… suboptimal!

I would not have dared 100ms, but agree with @ceache that the backoff should cause clients to quickly adapt to the actual network.

In any case, I vote for it over repeated puzzled users :)

I like the changes making the client less lazy but I am not sure if that kind of default value would suit to every current possible uses of the client and, if possible, I would prefer we try not to change it.

As Jeff wrote, "timeouts are always tricky"—but I don't understand why we would want to keep using this value, as it basically guarantees that the session is… dead… by the deadline. Am I missing something?

It seems that the default value should ideally be below (lifetime / n_hosts), to give the client a chance to try each host once—and even lower if we want to allow multiple "rounds."

For the record, the Java client initially sets the TCP/IP connect timeout to (requested_lifetime / n_hosts), and ajusts it to (negotiated_lifetime / n_hosts) once the negotiated lifetime is known—but it does not apply backoff.

I noticed that in ConnectionHandler, the retry_sleeper is only used for the zk_loop, do you think some logic around the initial connection_retry value passed to the client would be doable/make sense? My (possibly naive) idea would be to use your new behavior if the connection_retry is given (ie. the user was kind of expecting this behavior), but keep the previous value if connection_retry is None (ie. just keep things like before).

This would alleviate the need for a patched Kazoo, but would IMHO still be a suboptimal default.

Cheers, -D

andanicolae commented 1 month ago

Hi all. Gentle reminder :) Can someone pls take a look and do the necessary changes so that this patch gets merged? I would need a new kazoo package version that includes this patch. Thank you :)

python-zk / kazoo