scylladb / python-driver

ScyllaDB Python Driver, originally DataStax Python Driver for Apache Cassandra
https://python-driver.docs.scylladb.com
Apache License 2.0
70 stars 42 forks source link

Connection doesn't propagate information about being closed to Cluster #345

Open Lorak-mmk opened 1 month ago

Lorak-mmk commented 1 month ago

Discovered when investigating https://github.com/scylladb/scylla-dtest/issues/4364

When the node goes down it will close client connections (probably not always? I guess if it dies unexpectedly then it has no way to), and the connections in the driver will notice it. The logs look like this:

18:51:41,609 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694180404560) 127.0.10.1:9042> closed by server
18:51:41,609 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694180404560) to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694185696976) 127.0.10.1:9042> closed by server
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694185696976) to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:9042
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694185158224) 127.0.10.1:19042> closed by server
18:51:41,610 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694185158224) to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:373  | Connection <LibevConnection(140694180402832) 127.0.10.1:19042> closed by server
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:287  | Closing connection (140694180402832) to 127.0.10.1:19042
18:51:41,611 cassandra.io.libevreactor DEBUG libevreactor.py:291  | Closed socket to 127.0.10.1:19042

the problem is that the information about those connections closing is not propagated anywhere: driver still thinks it has fully functioning connection pool - and if dead node was the one driver had control connection opened to, then the driver still thinks it has functioning control connection and waits for events. Driver will notice that those connections are dead only when it tries to use them - send heartbeat / cql query / refresh schema etc.

This is a problem in the following scenario (this is done in https://github.com/scylladb/scylla-dtest/issues/4364):

What the driver should do is propagate the information from single connection upwards and reopen connections / mark host as down.

mykaul commented 1 month ago

We should really use TCP keep-alive everywhere, just like the GoCQL now uses it by default.

Lorak-mmk commented 1 month ago

TCP keep-alive is not the solution here. The connection itself (and by connection I mean instance of Connection class) was closed gracefully and the connection knows that it was closed. The issue is that the connection doesn't propagate this information to the Cluster object.