scylladb / scylladb

NoSQL data store using the seastar framework, compatible with Apache Cassandra
http://scylladb.com
GNU Affero General Public License v3.0
13.65k stars 1.3k forks source link

Client request timeout on DROP INDEX during disrupt_create_index nemesis #16661

Open temichus opened 10 months ago

temichus commented 10 months ago

Issue description

error

2023-12-30 18:28:36.108: (DisruptionEvent Severity.ERROR) period_type=end event_id=833631fc-6aaf-40c9-9645-b98d2fb791cc duration=3h25m57s: nemesis_name=CreateIndex target_node=Node longevity-twcs-48h-master-db-node-c08ea734-4 [18.201.157.158 | 10.4.10.227] (seed: True) errors=errors={'10.4.9.249:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.9.249:9042
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5063, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4730, in disrupt_create_index
drop_index(session, ks, index_name)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/nemesis_utils/indexes.py", line 122, in drop_index
session.execute(f'DROP INDEX {ks}.{index_name}')
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1749, in execute_verbose
return execute_orig(*args, **kwargs)
File "cassandra/cluster.py", line 2699, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 5018, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'10.4.9.249:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.9.249:9042

occurs during disrupt_create_index nemesis

Impact

Describe the impact this issue causes to the user.

How frequently does it reproduce?

for now see only once

Installation details

Kernel Version: 5.15.0-1051-aws Scylla version (or git commit hash): 5.5.0~dev-20231227.331d9ce788e2 with build-id 5a3ba5068a1b94097fb0f3fab64cdb912cff2911

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-0417c7525e0d98293 (aws: undefined_region)

Test: longevity-twcs-48h-test Test id: c08ea734-1c32-43b7-b1d5-fec05b22887d Test name: scylla-master/longevity/longevity-twcs-48h-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor c08ea734-1c32-43b7-b1d5-fec05b22887d` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=c08ea734-1c32-43b7-b1d5-fec05b22887d) - Show all stored logs command: `$ hydra investigate show-logs c08ea734-1c32-43b7-b1d5-fec05b22887d` ## Logs: - **db-cluster-c08ea734.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/db-cluster-c08ea734.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/db-cluster-c08ea734.tar.gz) - **sct-runner-events-c08ea734.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/sct-runner-events-c08ea734.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/sct-runner-events-c08ea734.tar.gz) - **sct-c08ea734.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/sct-c08ea734.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/sct-c08ea734.log.tar.gz) - **loader-set-c08ea734.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/loader-set-c08ea734.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/loader-set-c08ea734.tar.gz) - **monitor-set-c08ea734.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/monitor-set-c08ea734.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/c08ea734-1c32-43b7-b1d5-fec05b22887d/20231231_050721/monitor-set-c08ea734.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/longevity/job/longevity-twcs-48h-test/101/) [Argus](https://argus.scylladb.com/test/ca1ac895-78c0-4abd-80d5-0ba6dc7844bf/runs?additionalRuns[]=c08ea734-1c32-43b7-b1d5-fec05b22887d)
mykaul commented 10 months ago

What do we see in the node logs and metrics during that time?

timtimb0t commented 1 week ago

Such an error reproduced at this argus:

Packages

Scylla version: 6.3.0~dev-20241108.aebb5329068e with build-id f25ba153fbf85f1e556539e48f980dd93e3ab285

Kernel Version: 6.8.0-1018-aws

Issue description

Describe the impact this issue causes to the user.

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

OS / Image: ami-07f847bea92dccb9a (aws: undefined_region)

Test: longevity-twcs-48h-test Test id: b7272755-2d70-4e84-8a05-7cb0559db73d Test name: scylla-master/tier1/longevity-twcs-48h-test Test method: longevity_twcs_test.TWCSLongevityTest.test_custom_time Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor b7272755-2d70-4e84-8a05-7cb0559db73d` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=b7272755-2d70-4e84-8a05-7cb0559db73d) - Show all stored logs command: `$ hydra investigate show-logs b7272755-2d70-4e84-8a05-7cb0559db73d` ## Logs: - **longevity-twcs-48h-master-db-node-b7272755-1** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241109_040647/longevity-twcs-48h-master-db-node-b7272755-1-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241109_040647/longevity-twcs-48h-master-db-node-b7272755-1-b7272755.tar.gz) - **longevity-twcs-48h-master-db-node-b7272755-2** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241109_040647/longevity-twcs-48h-master-db-node-b7272755-2-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241109_040647/longevity-twcs-48h-master-db-node-b7272755-2-b7272755.tar.gz) - **db-cluster-b7272755.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/db-cluster-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/db-cluster-b7272755.tar.gz) - **sct-runner-events-b7272755.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/sct-runner-events-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/sct-runner-events-b7272755.tar.gz) - **sct-b7272755.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/sct-b7272755.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/sct-b7272755.log.tar.gz) - **loader-set-b7272755.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/loader-set-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/loader-set-b7272755.tar.gz) - **monitor-set-b7272755.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/monitor-set-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/monitor-set-b7272755.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-twcs-48h-test/43/) [Argus](https://argus.scylladb.com/test/ecd497c0-82d6-4269-b053-f5c2157e04ae/runs?additionalRuns[]=b7272755-2d70-4e84-8a05-7cb0559db73d)

Israel made investigation there