scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.65k stars 1.3k forks source link

[x86_64, dev] topology_experimental_raft/test_topology_ops failed with FAILED #19969

Open scylladb-promoter opened 4 months ago

scylladb-promoter commented 4 months ago

https://jenkins.scylladb.com/job/scylla-master/job/next/8019/ failed with the following error:

tchaikov commented 4 months ago
[2024-07-31T20:42:26.737Z] The following test(s) have failed: topology_experimental_raft/test_topology_ops
[2024-07-31T20:42:26.737Z] Output of pytest --artifacts_dir_url=https://jenkins.scylladb.com/job/scylla-master/job/next/8019/artifact --manager-api=/tmp/manager-giqx3c1m/api --mode=dev --tmpdir=/scylladir/testlog/x86_64 --run_id=3 -s --log-level=DEBUG -o junit_family=xunit2 -o junit_suite_name=topology_experimental_raft --junit-xml=/scylladir/testlog/x86_64/dev/xml/topology_experimental_raft.test_topology_ops.3.xunit.xml -rs --run_id=3 --mode=dev test/topology_experimental_raft/test_topology_ops.py:
[2024-07-31T20:42:26.737Z] === TEST.PY STARTING TEST topology_experimental_raft.test_topology_ops.3 ===
..
[2024-07-31T20:42:26.850Z] 17:02:42.029 INFO> Stopping node Server(4274, 127.40.51.28, 127.40.51.28)
[2024-07-31T20:42:26.850Z] 17:02:42.029 DEBUG> Connection 140702590876416: '('127.0.0.1', 47010)' -> '('127.40.51.28', 9042)'
[2024-07-31T20:42:26.850Z] 17:02:42.029 DEBUG> Host 127.40.51.28:9042 is now marked up
[2024-07-31T20:42:26.851Z] 17:02:42.029 DEBUG> ManagerClient stopping gracefully 4274
..
[2024-07-31T20:42:26.854Z] 17:02:44.154 INFO> Replacing node Server(4274, 127.40.51.28, 127.40.51.28)
..
[2024-07-31T20:42:26.854Z] 17:02:48.525 DEBUG> ManagerClient added Server(4279, 127.40.51.14, 127.40.51.14)
..
[2024-07-31T20:42:26.857Z] 17:06:56.041 WARNING> Heartbeat failed for connection (140702565127744) to 127.40.51.46:9042
[2024-07-31T20:42:26.857Z] 17:06:56.041 WARNING> Heartbeat failed for connection (140702565079120) to 127.40.51.46:9042
[2024-07-31T20:42:26.857Z] 17:06:56.041 WARNING> Heartbeat failed for connection (140702565123328) to 127.40.51.52:9042
[2024-07-31T20:42:26.857Z] 17:06:56.041 WARNING> Heartbeat failed for connection (140702565085312) to 127.40.51.40:9042
[2024-07-31T20:42:26.858Z] 17:06:56.041 WARNING> Heartbeat failed for connection (140702565050912) to 127.40.51.40:9042
[2024-07-31T20:42:26.858Z] 17:06:56.041 WARNING> Heartbeat failed for connection (140702677151920) to 127.40.51.46:9042
...
[2024-07-31T20:42:26.860Z] 17:06:56.042 DEBUG> Shutting down connections to 127.40.51.46:9042
[2024-07-31T20:42:26.860Z] 17:06:56.042 DEBUG> Defuncting connection (140702565123328) to 127.40.51.52:9042: errors=Connection heartbeat timeout after -0.0005295276641845703 seconds, last_host=127.40.51.52:9042
[2024-07-31T20:42:26.860Z] 17:06:56.042 DEBUG> Closing connection (140702565123328) to 127.40.51.52:9042
[2024-07-31T20:42:26.860Z] 17:06:56.042 DEBUG> Closed socket to 127.40.51.52:9042
[2024-07-31T20:42:26.860Z] 17:06:56.042 DEBUG> Defunct or closed connection (140702565123328) returned to pool, potentially marking host 127.40.51.52:9042 as down
..
[2024-07-31T20:42:26.862Z] 17:06:56.043 WARNING> Host 127.40.51.40:9042 has been marked down
..
[2024-07-31T20:42:26.863Z] 17:06:56.053 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.863Z] 17:06:56.053 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.863Z] 17:07:26.054 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.864Z] 17:07:26.054 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.864Z] 17:07:56.055 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.864Z] 17:07:56.055 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.864Z] 17:08:26.056 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.864Z] 17:08:26.056 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.864Z] 17:08:56.056 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.864Z] 17:08:56.057 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.865Z] 17:09:26.057 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.865Z] 17:09:26.057 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.865Z] 17:09:28.526 DEBUG> Closing connection (140702563559632) to 127.40.51.46:9042
[2024-07-31T20:42:26.865Z] 17:09:28.527 DEBUG> Closed socket to 127.40.51.46:9042
[2024-07-31T20:42:26.865Z] 17:09:28.527 WARNING> [control connection] Error connecting to 127.40.51.46:9042:
[2024-07-31T20:42:26.865Z] Traceback (most recent call last):
[2024-07-31T20:42:26.865Z]   File "cassandra/cluster.py", line 3629, in cassandra.cluster.ControlConnection._connect_host_in_lbp
[2024-07-31T20:42:26.865Z]   File "cassandra/cluster.py", line 3675, in cassandra.cluster.ControlConnection._try_connect
[2024-07-31T20:42:26.865Z]   File "cassandra/cluster.py", line 1700, in cassandra.cluster.Cluster.connection_factory
[2024-07-31T20:42:26.866Z]   File "cassandra/connection.py", line 889, in cassandra.connection.Connection.factory
[2024-07-31T20:42:26.866Z] cassandra.OperationTimedOut: errors=Timed out creating connection (200 seconds), last_host=None
..
[2024-07-31T20:42:26.868Z] 17:12:48.528 WARNING> [control connection] Error connecting to 127.40.51.52:9042:
[2024-07-31T20:42:26.868Z] Traceback (most recent call last):
[2024-07-31T20:42:26.868Z]   File "cassandra/cluster.py", line 3629, in cassandra.cluster.ControlConnection._connect_host_in_lbp
[2024-07-31T20:42:26.868Z]   File "cassandra/cluster.py", line 3675, in cassandra.cluster.ControlConnection._try_connect
[2024-07-31T20:42:26.868Z]   File "cassandra/cluster.py", line 1700, in cassandra.cluster.Cluster.connection_factory
[2024-07-31T20:42:26.868Z]   File "cassandra/connection.py", line 889, in cassandra.connection.Connection.factory
[2024-07-31T20:42:26.868Z] cassandra.OperationTimedOut: errors=Timed out creating connection (200 seconds), last_host=None
..
[2024-07-31T20:42:26.871Z] 17:16:08.528 WARNING> [control connection] Error connecting to 127.40.51.40:9042:
[2024-07-31T20:42:26.871Z] Traceback (most recent call last):
[2024-07-31T20:42:26.871Z]   File "cassandra/cluster.py", line 3629, in cassandra.cluster.ControlConnection._connect_host_in_lbp
[2024-07-31T20:42:26.871Z]   File "cassandra/cluster.py", line 3675, in cassandra.cluster.ControlConnection._try_connect
[2024-07-31T20:42:26.872Z]   File "cassandra/cluster.py", line 1700, in cassandra.cluster.Cluster.connection_factory
[2024-07-31T20:42:26.872Z]   File "cassandra/connection.py", line 889, in cassandra.connection.Connection.factory
..
[2024-07-31T20:42:26.879Z] 17:22:48.531 DEBUG> Sending initial options message for new connection (140702564133840) to 127.40.51.40:9042
[2024-07-31T20:42:26.880Z] 17:22:56.079 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.880Z] 17:22:56.079 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.880Z] 17:23:26.079 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.880Z] 17:23:26.079 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.880Z] 17:23:56.080 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.880Z] 17:23:56.080 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.880Z] 17:24:26.081 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.880Z] 17:24:26.081 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.881Z] 17:24:56.081 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.881Z] 17:24:56.082 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.881Z] 17:25:26.082 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.881Z] 17:25:26.083 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.881Z] 17:25:56.083 DEBUG> Cannot send heartbeat message on connection (140702563761904) to 127.40.51.28:9042
[2024-07-31T20:42:26.881Z] 17:25:56.084 DEBUG> Cannot send heartbeat message on connection (140702677151920) to 127.40.51.46:9042
[2024-07-31T20:42:26.881Z] 17:26:08.532 DEBUG> Closing connection (140702564133840) to 127.40.51.40:9042
[2024-07-31T20:42:26.882Z] 17:26:08.532 DEBUG> Closed socket to 127.40.51.40:9042
[2024-07-31T20:42:26.882Z] 17:26:08.532 WARNING> [control connection] Error connecting to 127.40.51.40:9042:
[2024-07-31T20:42:26.882Z] Traceback (most recent call last):
[2024-07-31T20:42:26.882Z]   File "cassandra/cluster.py", line 3629, in cassandra.cluster.ControlConnection._connect_host_in_lbp
[2024-07-31T20:42:26.882Z]   File "cassandra/cluster.py", line 3675, in cassandra.cluster.ControlConnection._try_connect
[2024-07-31T20:42:26.882Z]   File "cassandra/cluster.py", line 1700, in cassandra.cluster.Cluster.connection_factory
[2024-07-31T20:42:26.882Z]   File "cassandra/connection.py", line 889, in cassandra.connection.Connection.factory
[2024-07-31T20:42:26.882Z] cassandra.OperationTimedOut: errors=Timed out creating connection (200 seconds), last_host=None
[2024-07-31T20:42:26.882Z] 17:26:08.532 DEBUG> [control connection] Opening new connection to 127.40.51.28:9042
[2024-07-31T20:42:26.882Z] 17:26:08.533 WARNING> [control connection] Error connecting to 127.40.51.28:9042:
[2024-07-31T20:42:26.883Z] Traceback (most recent call last):
[2024-07-31T20:42:26.883Z]   File "cassandra/cluster.py", line 3629, in cassandra.cluster.ControlConnection._connect_host_in_lbp
[2024-07-31T20:42:26.883Z]   File "cassandra/cluster.py", line 3675, in cassandra.cluster.ControlConnection._try_connect
[2024-07-31T20:42:26.883Z]   File "cassandra/cluster.py", line 1700, in cassandra.cluster.Cluster.connection_factory
[2024-07-31T20:42:26.883Z]   File "cassandra/connection.py", line 876, in cassandra.connection.Connection.factory
[2024-07-31T20:42:26.883Z]   File "/usr/local/lib64/python3.12/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
[2024-07-31T20:42:26.883Z]     self._connect_socket()
[2024-07-31T20:42:26.883Z]   File "cassandra/connection.py", line 959, in cassandra.connection.Connection._connect_socket
[2024-07-31T20:42:26.883Z] ConnectionRefusedError: [Errno 111] Tried connecting to [('127.40.51.28', 9042)]. Last error: Connection refused
[2024-07-31T20:42:26.884Z] 17:26:08.533 DEBUG> [control connection] Opening new connection to 127.40.51.50:9042
[2024-07-31T20:42:26.884Z] 17:26:08.533 WARNING> [control connection] Error connecting to 127.40.51.50:9042:
[2024-07-31T20:42:26.884Z] Traceback (most recent call last):
[2024-07-31T20:42:26.884Z]   File "cassandra/cluster.py", line 3629, in cassandra.cluster.ControlConnection._connect_host_in_lbp
[2024-07-31T20:42:26.884Z]   File "cassandra/cluster.py", line 3675, in cassandra.cluster.ControlConnection._try_connect
[2024-07-31T20:42:26.884Z]   File "cassandra/cluster.py", line 1700, in cassandra.cluster.Cluster.connection_factory
[2024-07-31T20:42:26.884Z]   File "cassandra/connection.py", line 876, in cassandra.connection.Connection.factory
[2024-07-31T20:42:26.884Z]   File "/usr/local/lib64/python3.12/site-packages/cassandra/io/libevreactor.py", line 267, in __init__
[2024-07-31T20:42:26.885Z]     self._connect_socket()
[2024-07-31T20:42:26.885Z]   File "cassandra/connection.py", line 959, in cassandra.connection.Connection._connect_socket
[2024-07-31T20:42:26.885Z] ConnectionRefusedError: [Errno 111] Tried connecting to [('127.40.51.50', 9042)]. Last error: Connection refused
[2024-07-31T20:42:26.885Z] 17:26:08.534 DEBUG> Removed connection pool for <Host: 127.40.51.46:9042 datacenter1>
[2024-07-31T20:42:26.885Z] 17:26:08.534 DEBUG> Starting reconnector for host 127.40.51.46:9042
[2024-07-31T20:42:26.885Z] 17:26:08.534 DEBUG> Removed connection pool for <Host: 127.40.51.52:9042 datacenter1>
[2024-07-31T20:42:26.885Z] 17:26:08.534 DEBUG> Starting reconnector for host 127.40.51.52:9042
[2024-07-31T20:42:26.885Z] 17:26:08.534 ERROR> Write started 1406.5036611557007s ago failed: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.40.51.52:9042
[2024-07-31T20:42:26.886Z] 
[2024-07-31T20:42:26.886Z] test/topology_experimental_raft/test_topology_ops.py::test_topology_ops[True].dev.3 FAILED
tchaikov commented 4 months ago
[2024-07-31T20:42:26.885Z] 17:26:08.534 ERROR> Write started 1406.5036611557007s ago failed: errors={'Connection defunct by heartbeat': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=127.40.51.52:9042

a write op timed out after we replaced 127.40.51.28 with 127.40.51.14. it just lost the access to the cluster.

mykaul commented 1 month ago

Dup of https://github.com/scylladb/scylladb/issues/20874 (less logs here)

mykaul commented 1 month ago

And fix is hopefully https://github.com/scylladb/scylladb/pull/20977