Session might not reconnect if all nodes are restarted at once

piodul commented 1 year ago

Observed in the following test run: https://jenkins.scylladb.com/job/scylla-master/job/next/6139/artifact/testlog/x86_64/dev/topology.test_cluster_features.2.log I'm also attaching the logs to the issue: jenkins.scylladb.com_job_scylla-master_job_next_6139_artifact_testlog_x86_64_dev_topology.test_cluster_features.2.log

Link to the source of the failing test: https://github.com/scylladb/scylladb/blob/3a73048bc9a15bebca78dc89143e4e332fb50645/test/topology/test_cluster_features.py#L150

The test test_downgrade_after_successful_upgrade_fails shuts down all nodes in the cluster, reconfigures them and then starts them again. In the linked run, the test session didn't reconnect after it happened:

14:31:46.962 DEBUG> refresh driver node list
14:31:46.963 DEBUG> [control connection] Refreshing node list and token map
14:31:46.963 DEBUG> [control connection] Error refreshing node list and token map
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3790, in cassandra.cluster.ControlConnection.refresh_node_list_and_token_map
  File "cassandra/cluster.py", line 3816, in cassandra.cluster.ControlConnection._refresh_node_list_and_token_map
  File "cassandra/connection.py", line 1097, in cassandra.connection.Connection.wait_for_responses
cassandra.connection.ConnectionShutdown: Connection <LibevConnection(139961298484496) 127.76.173.27:9042 (closed)> is already closed
14:31:47.003 DEBUG> refresh driver node list
14:31:47.003 DEBUG> [control connection] Refreshing node list and token map
14:31:47.003 DEBUG> [control connection] Error refreshing node list and token map
Traceback (most recent call last):
  File "cassandra/cluster.py", line 3790, in cassandra.cluster.ControlConnection.refresh_node_list_and_token_map
  File "cassandra/cluster.py", line 3816, in cassandra.cluster.ControlConnection._refresh_node_list_and_token_map
  File "cassandra/connection.py", line 1097, in cassandra.connection.Connection.wait_for_responses
cassandra.connection.ConnectionShutdown: Connection <LibevConnection(139961298484496) 127.76.173.27:9042 (closed)> is already closed
14:31:49.004 DEBUG> Error querying host 127.76.173.27:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.27:9042 is closed
14:31:49.004 DEBUG> Defunct or closed connection (139961296996112) returned to pool, potentially marking host 127.76.173.27:9042 as down
14:31:49.004 DEBUG> Shutting down connections to 127.76.173.27:9042
14:31:49.004 DEBUG> Closing connection (139961296996112) to 127.76.173.27:9042
14:31:51.005 DEBUG> Error querying host 127.76.173.21:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.21:9042 is closed
14:31:51.005 DEBUG> Defunct or closed connection (139961297243024) returned to pool, potentially marking host 127.76.173.21:9042 as down
14:31:51.005 DEBUG> Shutting down connections to 127.76.173.21:9042
14:31:51.005 DEBUG> Closing connection (139961297243024) to 127.76.173.21:9042
14:31:51.005 DEBUG> Closing excess connection (139961297729488) to 127.76.173.21:9042
14:31:53.006 DEBUG> Error querying host 127.76.173.6:9042
Traceback (most recent call last):
  File "cassandra/cluster.py", line 4581, in cassandra.cluster.ResponseFuture._query
  File "cassandra/connection.py", line 1066, in cassandra.connection.Connection.send_msg
cassandra.connection.ConnectionShutdown: Connection to 127.76.173.6:9042 is closed
14:31:53.006 DEBUG> Defunct or closed connection (139961297252176) returned to pool, potentially marking host 127.76.173.6:9042 as down
14:31:53.006 DEBUG> Shutting down connections to 127.76.173.6:9042
14:31:53.006 DEBUG> Closing connection (139961297252176) to 127.76.173.6:9042
14:31:53.007 INFO> Driver not connected to 127.76.173.27:9042 yet
14:31:53.007 INFO> Driver not connected to 127.76.173.21:9042 yet
14:31:53.007 INFO> Driver not connected to 127.76.173.6:9042 yet

...

14:32:46.100 INFO> Driver not connected to 127.76.173.27:9042 yet
14:32:46.100 INFO> Driver not connected to 127.76.173.21:9042 yet
14:32:46.101 INFO> Driver not connected to 127.76.173.6:9042 yet
---------------------------- Captured log teardown -----------------------------
14:32:47.140 DEBUG> after_test for test_downgrade_after_successful_upgrade_fails (success: False)

The driver doesn't reconnect automatically within a minute. IPs of the restarted nodes are the same as before

kbr-scylla commented 1 year ago

I wonder if it's the same thing as https://github.com/scylladb/python-driver/issues/170 - which was supposedly fixed - or maybe it's some different kind of race which only happens when we restart all nodes?

kostja commented 11 months ago

Please we need some traction on this, it affects the stability of the tests.

mykaul commented 8 months ago

@avelanarius - ping

avelanarius commented 8 months ago

@sylwiaszunejko is currently working on this issue: https://github.com/scylladb/python-driver/issues/295 which seemed more current (before issues reported by @piodul today).

I'll try to get to this soon.

scylladb / python-driver

Session might not reconnect if all nodes are restarted at once #230