Open ceache opened 1 year ago
@python-zk/maintainers This is an early draft of fix to the connection logic. The original issue that sent me on this fix is as follow:
zkc = KazooClient(hosts=["nonresponsive:2181", "goodhost:2181"], timeout=60)
zkc.start()
On a new connection, it results in a painful delay of 60 seconds wasted on the bad server before trying the good one. On a reconnect, it is worse. It results in the connection being stuck on an attempt on the bad server with a TCP timeout set to 60 seconds, i.e. equal to the session timeout. By the time it gives up, and moves on to trying the good host, the session is already lost.
Patch coverage: 100.00
% and project coverage change: -2.26
:warning:
Comparison is base (
2c36d69
) 96.65% compared to head (fd57818
) 94.40%.:exclamation: Current head fd57818 differs from pull request most recent head e34e407. Consider uploading reports for the commit e34e407 to get more accurate results
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Do you have feedback about the report comment? Let us know in this issue.
I am pondering if the below should be changed too.
I really do not think session_timeout_in_sec / number_of_hosts
is appropriate here.
I would like it to use the client's connection_retry
or even command_retry
.
connect_result, zxid = self._invoke(
client._session_timeout / 1000.0 / len(client.hosts), connect
)
Just like the others, I think the changes (+ explanations) make a lot of sense. Although, just like @ztzg commented, the new default value might seem a little low from what the previous was (100ms vs 10s). I like the changes making the client less lazy but I am not sure if that kind of default value would suit to every current possible uses of the client and, if possible, I would prefer we try not to change it.
I noticed that in ConnectionHandler
, the retry_sleeper
is only used for the zk_loop
, do you think some logic around the initial connection_retry
value passed to the client would be doable/make sense? My (possibly naive) idea would be to use your new behavior if the connection_retry
is given (ie. the user was kind of expecting this behavior), but keep the previous value if connection_retry
is None
(ie. just keep things like before).
Hi @StephenSorriaux, all,
It would be great to have this fixed in master, as teams using unpatched versions of Kazoo regularly experience "outages."
Just like the others, I think the changes (+ explanations) make a lot of sense. Although, just like @ztzg commented, the new default value might seem a little low from what the previous was (100ms vs 10s).
Note that 10s is (supposed to be) the session lifetime, and that by using it as a connection timeout, the session is dead by the time any unresponsive TCP/IP operation aborts. Which is… suboptimal!
I would not have dared 100ms, but agree with @ceache that the backoff should cause clients to quickly adapt to the actual network.
In any case, I vote for it over repeated puzzled users :)
I like the changes making the client less lazy but I am not sure if that kind of default value would suit to every current possible uses of the client and, if possible, I would prefer we try not to change it.
As Jeff wrote, "timeouts are always tricky"—but I don't understand why we would want to keep using this value, as it basically guarantees that the session is… dead… by the deadline. Am I missing something?
It seems that the default value should ideally be below (lifetime / n_hosts), to give the client a chance to try each host once—and even lower if we want to allow multiple "rounds."
For the record, the Java client initially sets the TCP/IP connect timeout to (requested_lifetime / n_hosts), and ajusts it to (negotiated_lifetime / n_hosts) once the negotiated lifetime is known—but it does not apply backoff.
I noticed that in
ConnectionHandler
, theretry_sleeper
is only used for thezk_loop
, do you think some logic around the initialconnection_retry
value passed to the client would be doable/make sense? My (possibly naive) idea would be to use your new behavior if theconnection_retry
is given (ie. the user was kind of expecting this behavior), but keep the previous value ifconnection_retry
isNone
(ie. just keep things like before).
This would alleviate the need for a patched Kazoo, but would IMHO still be a suboptimal default.
Cheers, -D
Hi all. Gentle reminder :) Can someone pls take a look and do the necessary changes so that this patch gets merged? I would need a new kazoo package version that includes this patch. Thank you :)
Why is this needed?
Do not use the session timeout as connection timeout. This value is too large and results in a bad non-responsive server holding up the client long enough for the session to timeout.
Proposed Changes
Does this PR introduce any breaking change?
This change should be transparent at a high level. It makes the client connection attempts more aggressive, on purpose.