scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
51 stars 34 forks source link

MgmtRepair nemeses failed with cause: get repair target: get cluster views: gocql: no response received from cassandra within timeout period #3612

Open temichus opened 11 months ago

temichus commented 11 months ago

Issue description

MgmtRepair nemeses failed too fast by error: get repair target: get cluster views: gocql: no response received from Cassandra within timeout period

next nemesis run finished too fast and looks false positive:

disrupt_mgmt_repair_cli longevity-5gb-1h-MgmtRepair-master-db-node-5681c472-1 Succeeded 2023-10-18 00:00:22 2023-10-18 00:48:42

the following runs failed by the other error:

Cause: another task is running

Impact

The repair task failed.

How frequently does it reproduce?

This is a intermittent problem.

Installation details

Kernel Version: 5.15.0-1047-aws Scylla version (or git commit hash): 5.4.0~dev-20231006.498e3ec435be with build-id 16c6112202348a8adba536b4195d48adfdf958f9

Cluster size: 3 nodes (i4i.large)

Scylla Nodes used in this run:

OS / Image: ami-057fff75e186f7fa9 (aws: undefined_region)

Test: longevity-5gb-1h-MgmtRepair-aws-test Test id: 5681c472-7f68-44fc-91bc-5ae9bd4feec1 Test name: scylla-master/nemesis/longevity-5gb-1h-MgmtRepair-aws-test Test config file(s):

Logs and commands - Restore Monitor Stack command: `$ hydra investigate show-monitor 5681c472-7f68-44fc-91bc-5ae9bd4feec1` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=5681c472-7f68-44fc-91bc-5ae9bd4feec1) - Show all stored logs command: `$ hydra investigate show-logs 5681c472-7f68-44fc-91bc-5ae9bd4feec1` ## Logs: - **db-cluster-5681c472.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/db-cluster-5681c472.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/db-cluster-5681c472.tar.gz) - **sct-runner-events-5681c472.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/sct-runner-events-5681c472.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/sct-runner-events-5681c472.tar.gz) - **sct-5681c472.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/sct-5681c472.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/sct-5681c472.log.tar.gz) - **loader-set-5681c472.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/loader-set-5681c472.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/loader-set-5681c472.tar.gz) - **monitor-set-5681c472.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/monitor-set-5681c472.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/monitor-set-5681c472.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/nemesis/job/longevity-5gb-1h-MgmtRepair-aws-test/56/) [Argus](https://argus.scylladb.com/test/3d6d82db-db9b-4d0b-a043-afcd729f5be3/runs?additionalRuns[]=5681c472-7f68-44fc-91bc-5ae9bd4feec1)
enaydanov commented 11 months ago

It looks like the root cause is that the cluster is under a lot of stress (@mykaul said: "The MV workload is killing the tests"), but from my pov, ScyllaDB Manager need to handle such timeouts in more intelligent way.

See https://github.com/scylladb/scylladb/issues/15761 and https://github.com/scylladb/scylladb/issues/15717