python-zk / kazoo

Kazoo is a high-level Python library that makes it easier to use Apache Zookeeper.
https://kazoo.readthedocs.io
Apache License 2.0
1.3k stars 386 forks source link

Connection timeout is not set to client._session_timeout / 1000 / len(client.hosts) in all places #756

Open andanicolae opened 1 month ago

andanicolae commented 1 month ago

Expected Behavior

Client should wait for maximum client._session_timeout / 1000 / len(client.hosts) seconds before moving to next zookeeper server.

Actual Behavior

We are using kazoo to connect each of our clients to one of the 3 zookeeper servers that form a cluster. We set timeout=120s when we instantiate kazoo.client.KazooClient class. Therefore, client._session_timeout = 120000.

Below is the code from kazoo that sets client._session_timeout: self._session_timeout = int(timeout * 1000)

Our purpose is to have negotiated session timeout=120s and connection_timeout=120s/3=40s. In some places of kazoo code it is timeout=client._session_timeout / 1000.0, while in some other places it is timeout=client._session_timeout / 1000.0 / len(client.hosts).

In all our zookeeper servers, in /conf/zoo.cfg, we have set maxSessionTimeout=120s. After negotiation, indeed, negotiated_session_timeout is 120s.

We add an iptables rule and we isolate zookeeper leader. The remaining 2 zookeeper servers close the sessions with all the clients and perform leader election. From what I have read, this is the behavior of zookeeper servers when the leader goes down.

kazoo tries to connect to next zookeeper server from the local server list (we shuffle zookeeper server list). Sometimes, kazoo will try to connect to the isolated zookeeper server. When kazoo tries to connect to any zookeeper server, this code will be called: with self._socket_error_handling(): self._socket = self.handler.create_connection( address=(hostip, port), timeout=client._session_timeout / 1000.0, use_ssl=self.client.use_ssl, keyfile=self.client.keyfile, certfile=self.client.certfile, ca=self.client.ca, keyfile_password=self.client.keyfile_password, verify_certs=self.client.verify_certs, )

As we can see, timeout is 120s. kazoo waits for 120s and then moves to another zookeeper server. Our expectation is to wait for 40s.

I have added additional logs in protocol/connection.py, which start with 'anda:' Here I have used dummy values as session ids/passwords.

Logs with logging in DEBUG mode

07/08/2024 03:12:51 AM Using session_id: XXXX session_passwd: YYYY 07/08/2024 03:12:51 AM anda: before create_connection 07/08/2024 03:14:51 AM Connection dropped: socket connection error: None 07/08/2024 03:14:51 AM anda: in _connect_loop after _connect_attempt host = hostip = 07/08/2024 03:14:51 AM anda: in _connect_loop before _connect_attempt host = hostip = 07/08/2024 03:14:51 AM Connecting to :port, use_ssl: False 07/08/2024 03:14:51 AM Using session_id: 0123 session_passwd: VVVV 07/08/2024 03:14:51 AM anda: before create_connection 07/08/2024 03:14:51 AM anda: after create_connection 07/08/2024 03:14:51 AM anda: after Connect 07/08/2024 03:14:51 AM Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=1234, time_out=120000, session_id=ZZZZ, passwd="TTTT", read_only=None) 07/08/2024 03:14:51 AM Read response Connect(protocol_version=0, last_zxid_seen=0, time_out=120000, session_id=ZZZZ, passwd="TTTT", read_only=False) 07/08/2024 03:14:51 AM anda: after _invoke 07/08/2024 03:14:51 AM Session created, session_id: ZZZZ session_passwd: AAAA negotiated session timeout: 120000 connect timeout: 40000 read timeout: 80000.0

Logs when I divide timeout value to len(client.hosts). Below is the modified code: with self._socket_error_handling(): self._socket = self.handler.create_connection( address=(hostip, port), timeout=client._session_timeout / 1000.0 / len(client.hosts), use_ssl=self.client.use_ssl, keyfile=self.client.keyfile, certfile=self.client.certfile, ca=self.client.ca, keyfile_password=self.client.keyfile_password, verify_certs=self.client.verify_certs, ) 07/09/2024 04:10:00 AM Connection dropped: socket connection broken 07/09/2024 04:10:00 AM Transition to CONNECTING 07/09/2024 04:10:00 AM Zookeeper connection lost 07/09/2024 04:10:00 AM anda: in _connect_loop after _connect_attempt host = hostip = 07/09/2024 04:10:00 AM anda: in _connect_loop before _connect_attempt host = hostip = 07/09/2024 04:10:00 AM Connecting to :port, use_ssl: False 07/09/2024 04:10:00 AM Using session_id: 1111 session_passwd: 1234 07/09/2024 04:10:00 AM anda: before create_connection 07/09/2024 04:10:40 AM Connection dropped: socket connection error: None 07/09/2024 04:10:40 AM anda: in _connect_loop after _connect_attempt host = hostip = 07/09/2024 04:10:40 AM anda: in _connect_loop before _connect_attempt host = = 07/09/2024 04:10:40 AM Connecting to :port, use_ssl: False 07/09/2024 04:10:40 AM Using session_id: 1234 session_passwd: AAAA 07/09/2024 04:10:40 AM anda: before create_connection 07/09/2024 04:10:40 AM anda: after create_connection 07/09/2024 04:10:40 AM anda: before setblocking 07/09/2024 04:10:40 AM anda: after setblocking 07/09/2024 04:10:40 AM anda: after Connect 07/09/2024 04:10:40 AM Sending request(xid=None): Connect(protocol_version=0, last_zxid_seen=3333, time_out=120000, session_id=1234, passwd='AAAA', read_only=None) 07/09/2024 04:10:40 AM Read response Connect(protocol_version=0, last_zxid_seen=0, time_out=120000, session_id=1234, passwd='BBBB', read_only=False) 07/09/2024 04:10:40 AM anda: after _invoke 07/09/2024 04:10:40 AM Session created, session_id: 1234 session_passwd: GGGG negotiated session timeout: 120000 connect timeout: 40000 read timeout: 80000.0 07/09/2024 04:10:40 AM Zookeeper connection established, state: CONNECTED

Specifications

andanicolae commented 1 month ago

Hello. Gentle reminder :) Can someone please take a look? Thank you

ceache commented 1 month ago

Hi there, thanks for reporting this.

This is, in fact, a known issue and we are already working on a fix for it. See #685 which dissociates network connection timeouts from the logical session timeout.

Cheers,

On Mon, Jul 15, 2024, 01:58 Anda Nicolae @.***> wrote:

Hello. Gentle reminder :) Can someone please take a look? Thank you

— Reply to this email directly, view it on GitHub https://github.com/python-zk/kazoo/issues/756#issuecomment-2227815203, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAIFTHVTIBLZ2EROZTHNHMTZMNXIHAVCNFSM6AAAAABKSWMDMGVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEMRXHAYTKMRQGM . You are receiving this because you are subscribed to this thread.Message ID: @.***>