Cassandra request timeout error occurred during dropping the index

timtimb0t commented 1 day ago

Packages

Scylla version: 6.3.0~dev-20241108.aebb5329068e with build-id f25ba153fbf85f1e556539e48f980dd93e3ab285

Kernel Version: 6.8.0-1018-aws

Issue description

New issue

Not sure what the root cause of this problem is, but right before the nemesis failure such an error appeared:

2024-11-09T08:45:13.983+00:00 longevity-twcs-48h-master-db-node-b7272755-2     !INFO | sudo[12485]: pam_unix(sudo:session): session closed for user root
2024-11-09T08:45:15.733+00:00 longevity-twcs-48h-master-db-node-b7272755-2     !INFO | amazon-ssm-agent[558]: 2024-11-09 08:45:15.3681 WARN EC2RoleProvider Failed to connect to Systems Manager with instance profile role credentials. Err: retrieved credentials failed to report to ssm. Error: AccessDeniedException: User: arn:aws:sts::797456418907:assumed-role/qa-scylla-manager-backup-role/i-0fbfb74538cfc43ab is not authorized to perform: ssm:UpdateInstanceInformation on resource: arn:aws:ec2:eu-west-1:797456418907:instance/i-0fbfb74538cfc43ab because no identity-based policy allows the ssm:UpdateInstanceInformation action
2024-11-09T08:45:15.733+00:00 longevity-twcs-48h-master-db-node-b7272755-2     !INFO | amazon-ssm-agent[558]: 2024-11-09 08:45:15.3967 ERROR EC2RoleProvider Failed to connect to Systems Manager with SSM role credentials. error calling RequestManagedInstanceRoleToken: AccessDeniedException: Systems Manager's instance management role is not configured for account: 797456418907
2024-11-09T08:45:15.733+00:00 longevity-twcs-48h-master-db-node-b7272755-2     !INFO | amazon-ssm-agent[558]:   status code: 400, request id: bfbccf7d-d95e-405d-be1e-c8c72be2f736

Impact

No implicit impact on scylla, seems to be SCT case

How frequently does it reproduce?

Describe the frequency with how this issue can be reproduced.

Installation details

Cluster size: 4 nodes (i3en.2xlarge)

Scylla Nodes used in this run:

longevity-twcs-48h-master-db-node-b7272755-6 (3.250.163.219 | 10.4.11.126) (shards: 7)
longevity-twcs-48h-master-db-node-b7272755-5 (3.253.74.59 | 10.4.9.224) (shards: 7)
longevity-twcs-48h-master-db-node-b7272755-4 (54.170.10.255 | 10.4.11.51) (shards: 7)
longevity-twcs-48h-master-db-node-b7272755-3 (54.154.162.240 | 10.4.10.44) (shards: 7)
longevity-twcs-48h-master-db-node-b7272755-2 (63.35.171.254 | 10.4.10.199) (shards: 7)
longevity-twcs-48h-master-db-node-b7272755-1 (34.247.85.227 | 10.4.11.77) (shards: 7)

OS / Image: ami-07f847bea92dccb9a (aws: undefined_region)

Test: longevity-twcs-48h-test Test id: b7272755-2d70-4e84-8a05-7cb0559db73d Test name: scylla-master/tier1/longevity-twcs-48h-test Test method: longevity_twcs_test.TWCSLongevityTest.test_custom_time Test config file(s):

longevity-twcs-48h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor b7272755-2d70-4e84-8a05-7cb0559db73d` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=b7272755-2d70-4e84-8a05-7cb0559db73d) - Show all stored logs command: `$ hydra investigate show-logs b7272755-2d70-4e84-8a05-7cb0559db73d` ## Logs: - **longevity-twcs-48h-master-db-node-b7272755-1** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241109_040647/longevity-twcs-48h-master-db-node-b7272755-1-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241109_040647/longevity-twcs-48h-master-db-node-b7272755-1-b7272755.tar.gz) - **longevity-twcs-48h-master-db-node-b7272755-2** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241109_040647/longevity-twcs-48h-master-db-node-b7272755-2-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241109_040647/longevity-twcs-48h-master-db-node-b7272755-2-b7272755.tar.gz) - **db-cluster-b7272755.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/db-cluster-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/db-cluster-b7272755.tar.gz) - **sct-runner-events-b7272755.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/sct-runner-events-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/sct-runner-events-b7272755.tar.gz) - **sct-b7272755.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/sct-b7272755.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/sct-b7272755.log.tar.gz) - **loader-set-b7272755.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/loader-set-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/loader-set-b7272755.tar.gz) - **monitor-set-b7272755.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/monitor-set-b7272755.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/b7272755-2d70-4e84-8a05-7cb0559db73d/20241110_041622/monitor-set-b7272755.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/tier1/job/longevity-twcs-48h-test/43/) [Argus](https://argus.scylladb.com/test/ecd497c0-82d6-4269-b053-f5c2157e04ae/runs?additionalRuns[]=b7272755-2d70-4e84-8a05-7cb0559db73d)

fruch commented 22 hours ago

@timtimb0t

he logs like you referred to are irrelevant, it a constant error we get since node doesn't have AWS credentials, and it's o.k.

this is the relevant information:

2024-11-09 08:48:25.137: (DisruptionEvent Severity.ERROR) period_type=end event_id=0599554f-90cb-4cca-be75-25060b00ec34 duration=3h23m27s: nemesis_name=CreateIndex target_node=Node longevity-twcs-48h-master-db-node-b7272755-2 [63.35.171.254 | 10.4.10.199] errors=errors={'10.4.11.51:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.11.51:9042
Traceback (most recent call last):
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 5354, in wrapper
result = method(*args[1:], **kwargs)
File "/home/ubuntu/scylla-cluster-tests/sdcm/nemesis.py", line 4999, in disrupt_create_index
drop_index(session, ks, index_name)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/nemesis_utils/indexes.py", line 116, in drop_index
session.execute(SimpleStatement(f'DROP INDEX {ks}.{index_name}'), timeout=300)
File "/home/ubuntu/scylla-cluster-tests/sdcm/utils/common.py", line 1318, in execute_verbose
return execute_orig(*args, **kwargs)
File "cassandra/cluster.py", line 2729, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 5120, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'10.4.11.51:9042': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.4.11.51:9042

the nemesis is running for more than 3h: 2024-11-09 05:24:57 2024-11-09 08:48:25

and the load during it dropped like crazy:

fruch commented 22 hours ago

there are multiple reports of that in scylla issue: https://github.com/scylladb/scylladb/issues/16661

i.e. sound like it's not something new, and not sure SCT can do anything about it.

scylladb / scylla-cluster-tests