python-zk / kazoo

Kazoo is a high-level Python library that makes it easier to use Apache Zookeeper.
https://kazoo.readthedocs.io
Apache License 2.0
1.3k stars 386 forks source link

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (… #744

Closed jeblair closed 6 months ago

jeblair commented 6 months ago

Revert "Fix possible endless wait in stop() after AUTH_FAILED error (#688)"

This reverts commit 5225b3e2fab6fec3b12b789e3cc6f3218429d32d.

The commit being reverted here caused kazoo not to empty the send queue before disconnecting. This means that if a client submitted asynchronous requests and then called client.stop(), the connection would be closed immediately, usually after only one (but possibly more) of the submitted requests were sent. Prior to this, Kazoo would empty the queue of submitted requests all the way up to and including the Close request when client.stop() was called.

Another area where this caused problems is in a busy multi-threaded system. One thread might decide to gracefully close the connection, but if there is any traffic generated by another thread, then the connection would end up terminating without ever sending the Close request.

Failure to gracefully shutdown a ZooKeeper connection can mean that other system components need to wait for ephemeral node timeouts to detect that a component has shutdown.

Given that this behavior is easily reproducible and can have serious consequences in production, 5225b3e2 is reverted.

jeblair commented 6 months ago

Here is a test script that will show the behavior in both async and threaded environments:

import kazoo.client
import time
import threading
import logging

logging.basicConfig(level=5)
l = logging.getLogger('kazoo')
l.setLevel(5)

client = kazoo.client.KazooClient('localhost:2181')
client.start()

def send():
    client.create('/test', b'',
                  makepath=True, ephemeral=True, sequence=True)

def test_thread():
    t = threading.Thread(target=send)
    t.start()
    client.stop()

def test_async():
    for x in range(100):
        client.create_async('/test', b'',
                            makepath=True, ephemeral=True, sequence=True)
    client.stop()

# test_thread()
test_async()
jeblair commented 6 months ago

In addition to the reasoning in the commit message, I have also attempted to reproduce the problem described in #688 in an attempt to avoid regressing on that bug fix. I have been unable to reproduce the error using the script provided in that PR. Since the PR description mentioned threading, and the provided script did not appear to use it, I also ran a test script I devised myself along with an added sleep call in Kazoo to try to reproduce the behavior and was still unable to do so. In every case, it appears that the remote ZooKeeper server closed the connection after the AUTH_FAILED response which caused the Kazoo loop to terminate. The script and changes I made can be seen at: https://github.com/jeblair/kazoo/commit/7e22fea13fe4fd8b38fb34e5476ce4a32119a874

I tested with ZooKeeper 3.8.3; perhaps that is a behavior change from when the bug was originally reported.

StephenSorriaux commented 6 months ago

Thank you for the PR, I will look into it.

(FWIW, tests failures seem unrelated to the change)

codecov[bot] commented 6 months ago

Codecov Report

All modified and coverable lines are covered by tests :white_check_mark:

Project coverage is 96.81%. Comparing base (4c6bad8) to head (2fb93a8). Report is 4 commits behind head on master.

Additional details and impacted files ```diff @@ Coverage Diff @@ ## master #744 +/- ## ========================================== - Coverage 96.84% 96.81% -0.03% ========================================== Files 27 27 Lines 3549 3549 ========================================== - Hits 3437 3436 -1 - Misses 112 113 +1 ```

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

ceache commented 6 months ago

Thanks!