Closed vponomaryov closed 1 year ago
@roydahan , @fruch , @ShlomiBalalis ^
Ran the job only with the problematic nemesis and looks like the stress command with that cl=ONE
config generates error SCT events only for the first attempt of the nemesis, allows to run job till the tearDown and then fails with the critical SCT event:
2023-09-12 18:23:30.494: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=29ef847c-52bc-4752-87ad-743d4da92727 during_nemesis=MgmtCorruptThenRepair duration=3h0m9s: node=Node longevity-encrypt-at-rest-200gb-6h--loader-node-f22364f2-1 [52.73.213.1 | 10.12.10.227] (seed: False)
stress_cmd=cassandra-stress read cl=ONE duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=50 -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..200200300 -log interval=5
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
com.datastax.driver.core.exceptions.Transport
Kernel Version: 5.15.0-1044-aws
Scylla version (or git commit hash): 2023.1.1-20230906.f4633ec973b0
with build-id b454e7a22f80cf71a33b2f39e47127225e8fbc13
Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image: ami-00ce399fc50357db3
(aws: undefined_region)
Test: vp-EaR-longevity-kms-200gb-6h-test
Test id: f22364f2-0e84-4e84-84ee-d3c3b89566a9
Test name: scylla-staging/valerii/vp-EaR-longevity-kms-200gb-6h-test
Test config file(s):
I'm not sure why encryption at rest case is using cl=ONE on it's main load, this is highly unusual for a main load. and one should need a very very good reason for running SCT with nemesis like that. (maybe it was for validation before introducing nemesis, and was forgotten)
and based on @roydahan comment I think MgmtCorruptThenRepair was wrongly introduced into the limited group
class MgmtCorruptThenRepair(Nemesis):
manager_operation = True
disruptive = True
kubernetes = True
limited = True
def disrupt(self):
self.disrupt_mgmt_corrupt_then_repair()
@roydahan we should rename that group, and document it clearly
this says nothing of importance to anyone...
limited: bool = False # flag that signal that nemesis are belong to limited set of nemesises
and based on @roydahan comment I think MgmtCorruptThenRepair was wrongly introduced into the limited group
class MgmtCorruptThenRepair(Nemesis): manager_operation = True disruptive = True kubernetes = True limited = True def disrupt(self): self.disrupt_mgmt_corrupt_then_repair()
@roydahan we should rename that group, and document it clearly
this says nothing of importance to anyone...
limited: bool = False # flag that signal that nemesis are belong to limited set of nemesises
I definitely agree. It's a meaningless name.
- If CorruptThenRepair isn't "limited", the same with manager cannot be limited.
I think the labels were copied from other manager nemesis, and not from CorruptThenRepair
CorruptThenRepair is not in the limited group
- Too late for name change, it's already a "brand"
Yeah, a bad one. should be renamed and documented properly in the code or even it's own markdown file if it's not enough
- If CorruptThenRepair isn't "limited", the same with manager cannot be limited.
I think the labels were copied from other manager nemesis, and not from CorruptThenRepair
CorruptThenRepair is not in the limited group
The failure happens before the mgmt involvement. It happens after scylla service start:
2023-09-08 17:24:56,179 f:tester.py l:626 c:LongevityTest p:INFO > Test start time 2023-09-08 17:24:56, duration is 550 and timeout set to 2023-09-09 02:34:56
...
NEMESIS
disrupt_mgmt_corrupt_then_repair longevity-encrypt-at-rest-200gb-6h--db-node-ee769899-2 Skipped 2023-09-09 00:00:44 2023-09-09 00:24:35
...
2023-09-09 00:01:36,592 f:nemesis.py l:1053 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.SisyphusMonkey: SStables amount to destroy (50 percent of all SStables): 79
...
2023-09-09 00:01:37,814 f:nemesis.py l:1068 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.SisyphusMonkey: Files /var/lib/scylla/data/keyspace1/standard1-ce6a94f04e6c11eebcd13adf5291bd09/me-9512-big-Data.db were destroyed
...
2023-09-09 00:02:12,185 f:nemesis.py l:1068 c:sdcm.nemesis p:DEBUG > sdcm.nemesis.SisyphusMonkey: Files /var/lib/scylla/data/keyspace1/standard1-ce6a94f04e6c11eebcd13adf5291bd09/me-10059-big-Data.db were destroyed
...
2023-09-09 00:02:12,185 f:remote_base.py l:520 c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
...
2023-09-09 00:02:16,514 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > java.io.IOException: Operation x0 on key(s) [314d4e4e374c314d3031]: Data returned was not validated
...
2023-09-09 00:04:00,154 f:cli.py l:645 c:sdcm.mgmt.cli p:DEBUG > Created task id is: repair/ed93feab-2a38-4a02-a42b-a0f4b1d8d594
So, +1
to @fruch here.
- Too late for name change, it's already a "brand"
Yeah, a bad one. should be renamed and documented properly in the code or even it's own markdown file if it's not enough
+1
here too. We are not selling anything to anyone using the SCT.
If something doesn't help us and even provide harm then we should change it.
Feel free to suggest a name in the PR that fixes it. I'll keep calling it "Limited" :)
Issue description
If we run a stress command with
cl=ONE
like in thetest-cases/longevity/longevity-encryption-at-rest-200GB-6h.yaml
config file:In parallel to the newly added
MgmtCorruptThenRepair
nemesis (https://github.com/scylladb/scylla-cluster-tests/pull/6531) then we get following errors in loader:CI job continues to run and only when teardown starts we get a
Critical SCT event
:Nemesis-related config options:
Nemesis attrs:
How frequently does it reproduce?
100%
Installation details
Kernel Version: 5.15.0-1044-aws Scylla version (or git commit hash):
2023.1.1-20230906.f4633ec973b0
with build-idb454e7a22f80cf71a33b2f39e47127225e8fbc13
Cluster size: 4 nodes (i4i.4xlarge)
Scylla Nodes used in this run:
OS / Image:
ami-00ce399fc50357db3
(aws: undefined_region)Test:
vp-EaR-longevity-kms-200gb-6h-test
Test id:ee769899-db49-464e-87c2-98563d8d63b2
Test name:scylla-staging/valerii/vp-EaR-longevity-kms-200gb-6h-test
Test config file(s):Logs and commands
- Restore Monitor Stack command: `$ hydra investigate show-monitor ee769899-db49-464e-87c2-98563d8d63b2` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=ee769899-db49-464e-87c2-98563d8d63b2) - Show all stored logs command: `$ hydra investigate show-logs ee769899-db49-464e-87c2-98563d8d63b2` ## Logs: - **db-cluster-ee769899.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/db-cluster-ee769899.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/db-cluster-ee769899.tar.gz) - **sct-runner-events-ee769899.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/sct-runner-events-ee769899.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/sct-runner-events-ee769899.tar.gz) - **sct-ee769899.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/sct-ee769899.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/sct-ee769899.log.tar.gz) - **loader-set-ee769899.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/loader-set-ee769899.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/loader-set-ee769899.tar.gz) - **monitor-set-ee769899.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/monitor-set-ee769899.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/monitor-set-ee769899.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-EaR-longevity-kms-200gb-6h-test/29/) [Argus](https://argus.scylladb.com/test/b8ec9629-422a-4c8d-9e30-7949c5f80f21/runs?additionalRuns[]=ee769899-db49-464e-87c2-98563d8d63b2)