Open temichus opened 11 months ago
It looks like the root cause is that the cluster is under a lot of stress (@mykaul said: "The MV workload is killing the tests"), but from my pov, ScyllaDB Manager need to handle such timeouts in more intelligent way.
See https://github.com/scylladb/scylladb/issues/15761 and https://github.com/scylladb/scylladb/issues/15717
Issue description
MgmtRepair nemeses failed too fast by error:
get repair target: get cluster views: gocql: no response received from Cassandra within timeout period
next nemesis run finished too fast and looks false positive:
the following runs failed by the other error:
Cause: another task is running
Impact
The repair task failed.
How frequently does it reproduce?
This is a intermittent problem.
Installation details
Kernel Version: 5.15.0-1047-aws Scylla version (or git commit hash):
5.4.0~dev-20231006.498e3ec435be
with build-id16c6112202348a8adba536b4195d48adfdf958f9
Cluster size: 3 nodes (i4i.large)
Scylla Nodes used in this run:
OS / Image:
ami-057fff75e186f7fa9
(aws: undefined_region)Test:
longevity-5gb-1h-MgmtRepair-aws-test
Test id:5681c472-7f68-44fc-91bc-5ae9bd4feec1
Test name:scylla-master/nemesis/longevity-5gb-1h-MgmtRepair-aws-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor 5681c472-7f68-44fc-91bc-5ae9bd4feec1` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=5681c472-7f68-44fc-91bc-5ae9bd4feec1) - Show all stored logs command: `$ hydra investigate show-logs 5681c472-7f68-44fc-91bc-5ae9bd4feec1` ## Logs: - **db-cluster-5681c472.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/db-cluster-5681c472.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/db-cluster-5681c472.tar.gz) - **sct-runner-events-5681c472.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/sct-runner-events-5681c472.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/sct-runner-events-5681c472.tar.gz) - **sct-5681c472.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/sct-5681c472.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/sct-5681c472.log.tar.gz) - **loader-set-5681c472.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/loader-set-5681c472.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/loader-set-5681c472.tar.gz) - **monitor-set-5681c472.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/monitor-set-5681c472.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/5681c472-7f68-44fc-91bc-5ae9bd4feec1/20231018_085832/monitor-set-5681c472.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-master/job/nemesis/job/longevity-5gb-1h-MgmtRepair-aws-test/56/) [Argus](https://argus.scylladb.com/test/3d6d82db-db9b-4d0b-a043-afcd729f5be3/runs?additionalRuns[]=5681c472-7f68-44fc-91bc-5ae9bd4feec1)