scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.53k stars 1.29k forks source link

Multiple `Reactor stalled` and `Exception when communicating` errors after c-s threads have started #11950

Closed ilya-rarov closed 1 year ago

ilya-rarov commented 1 year ago

Installation details

Kernel Version: 5.15.0-1020-aws Scylla version (or git commit hash): 2022.1.3-20220922.539a55e35 with build-id d1fb2faafd95058a04aad30b675ff7d2b930278d Relocatable Package: http://downloads.scylladb.com/unstable/scylla-enterprise/enterprise-2022.1/relocatable/2022-09-22T13:36:03Z/scylla-enterprise-x86_64-package.tar.gz Cluster size: 6 nodes (i3.2xlarge)

Scylla Nodes used in this run: No resources left at the end of the run

OS / Image: ami-0f6aebcffc8f7aa66 (aws: eu-west-1)

Test: scylla-cloud-longevity-cdc-100gb-4h Test id: 31c49a40-cd31-4214-b469-54011c1aaaf0 Test name: siren-tests/longevity-tests/scylla-cloud-longevity-cdc-100gb-4h Test config file(s):

Issue description

By 2022-11-08 16:01:55.317 we started all the c-s threads.

Then we ran a series of nodetool cfstats and nodetool flush commands (no nemeses were started at that moment) cfstats for audit keyspace (+ flush) - succeeded

< t:2022-11-08 16:02:06,231 f:cluster.py      l:4099 c:sdcm.cluster         p:DEBUG > <cluster_cloud.ScyllaCloudCluster object at 0x7fc8d6fb6ef0>: Get cfstats on the node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 for audit keyspace
< t:2022-11-08 16:02:06,231 f:events_processes.py l:147  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20221108-150617-417068,id=0x7fc8d70747c0,default=True]
< t:2022-11-08 16:02:06,232 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u scylla -pw '***'  flush "...
< t:2022-11-08 16:02:06,233 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-11-08 16:02:06.231: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=262ff7fc-358c-40f3-9258-fc25c0d788db: nodetool_command=flush node=longevity-cdc-100gb-4h-master-d
b-node-31c49a40-1
< t:2022-11-08 16:02:06,233 f:grafana.py      l:80   c:sdcm.sct_events.grafana p:DEBUG > GrafanaEventAggregator start a new time window (90 sec)
< t:2022-11-08 16:02:08,275 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u scylla -pw '***'  flush " finished with status 0
< t:2022-11-08 16:02:08,276 f:cluster.py      l:2567 c:sdcm.cluster_baremetal p:DEBUG > Node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 [None | 172.23.40.169] (seed: True): Command '/usr/bin/nodetool -u scylla -pw '***'  flush ' duration -> 2.0436472590008
634 s
< t:2022-11-08 16:02:12,556 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u scylla -pw '***'  cfstats audit" finished with status 0

Then cfstats for _cdctest keyspace (+ flush). flush succeeded

< t:2022-11-08 16:02:12,556 f:cluster.py      l:4099 c:sdcm.cluster         p:DEBUG > <cluster_cloud.ScyllaCloudCluster object at 0x7fc8d6fb6ef0>: Get cfstats on the node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 for cdc_test keyspace
< t:2022-11-08 16:02:12,556 f:events_processes.py l:147  c:sdcm.sct_events.events_processes p:DEBUG > Get process `MainDevice' from EventsProcessesRegistry[lod_dir=/home/ubuntu/sct-results/20221108-150617-417068,id=0x7fc8d70747c0,default=True]
< t:2022-11-08 16:02:12,556 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "/usr/bin/nodetool -u scylla -pw '***'  flush "...
< t:2022-11-08 16:02:12,558 f:file_logger.py  l:101  c:sdcm.sct_events.file_logger p:INFO  > 2022-11-08 16:02:12.556: (NodetoolEvent Severity.NORMAL) period_type=begin event_id=f134e547-e49e-4374-bcfa-4007108cb0a9: nodetool_command=flush node=longevity-cdc-100gb-4h-master-d
b-node-31c49a40-1
< t:2022-11-08 16:02:14,060 f:base.py         l:142  c:RemoteLibSSH2CmdRunner p:DEBUG > Command "/usr/bin/nodetool -u scylla -pw '***'  flush " finished with status 0
< t:2022-11-08 16:02:14,060 f:cluster.py      l:2567 c:sdcm.cluster_baremetal p:DEBUG > Node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 [None | 172.23.40.169] (seed: True): Command '/usr/bin/nodetool -u scylla -pw '***'  flush ' duration -> 1.5034743630003
504 s

cfstats eventually failed:

< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Command: "/usr/bin/nodetool -u scylla -pw '***'  cfstats cdc_test"
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Stdout:
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Stderr:
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > Exception:  File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 589, in run
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR >     channel = self.open_channel()
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR >   File "/home/ubuntu/scylla-cluster-tests/sdcm/remote/libssh2_client/__init__.py", line 687, in open_channel
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR >     raise OpenChannelTimeout(f'Failed to open channel in {timeout} seconds')
< t:2022-11-08 16:02:54,005 f:remote_libssh_cmd_runner.py l:66   c:RemoteLibSSH2CmdRunner p:ERROR > 

Approximately at this time we got Reactor stalled messages on several nodes:

Also, a lot of the following error messages were found in the node logs:

2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:02:41+00:00 ip-172-23-42-118      !ERR | scylla[9952]:  [shard 0] storage_proxy - Exception when communicating with 172.23.42.118, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)
2022-11-08T16:03:00+00:00 ip-172-23-40-225      !ERR | scylla[9834]:  [shard 1] storage_proxy - Exception when communicating with 172.23.40.225, to read from cdc_test.test_table_preimage: std::runtime_error (sl:default_read_concurrency_sem: wait queue overload)

Since nodetool command failed we triggered the coredump generation for scylla on node-1:

< t:2022-11-08 16:04:16,054 f:cluster.py      l:2472 c:sdcm.cluster_baremetal p:INFO  > Node longevity-cdc-100gb-4h-master-db-node-31c49a40-1 [None | 172.23.40.169] (seed: True): Generate scylla core
< t:2022-11-08 16:04:16,055 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo pkill -f --signal 3 /usr/bin/scylla"...

Coredump: node_1_scylla_coredump.txt

Logs:

Jenkins job URL

michoecho commented 1 year ago

Refs #8828?

avikivity commented 1 year ago

Closing as duplicate of #8828.