Newly added `MgmtCorruptThenRepair` nemesis makes CI jobs fail which use stress commands with `cl=ONE`

vponomaryov commented 1 year ago

Issue description

If we run a stress command with cl=ONE like in the test-cases/longevity/longevity-encryption-at-rest-200GB-6h.yaml config file:

stress_read_cmd: ["cassandra-stress read cl=ONE duration=360m \
    -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' \
    -mode cql3 native -rate threads=50  -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..200200300 -log interval=5"]

In parallel to the newly added MgmtCorruptThenRepair nemesis (https://github.com/scylladb/scylla-cluster-tests/pull/6531) then we get following errors in loader:

total,     446363976,   27804,   27804,   27804,     1.8,     1.6,     3.4,     5.1,    11.4,    32.5,20335.0,  0.00174,      0,      0,       0,       0,       0,       0
java.io.IOException: Operation x0 on key(s) [344c4e504d4f304b3630]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x0 on key(s) [38384c37344f36373331]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x0 on key(s) [364f333036364f303030]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x0 on key(s) [354d3734354d37353131]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x0 on key(s) [3833394e353730334d30]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x0 on key(s) [334e4f363336344d3730]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x0 on key(s) [3637384e37504d4f4f30]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x0 on key(s) [4b4e4e50324e34313630]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
java.io.IOException: Operation x0 on key(s) [4f3937394e5033363230]: Data returned was not validated

    at org.apache.cassandra.stress.Operation.error(Operation.java:141)
    at org.apache.cassandra.stress.Operation.timeWithRetry(Operation.java:119)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:101)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:109)
    at org.apache.cassandra.stress.operations.predefined.CqlOperation.run(CqlOperation.java:264)
    at org.apache.cassandra.stress.StressAction$Consumer.run(StressAction.java:473)
total,     446432257,   27126,   27123,   27123,     1.8,     1.6,     3.4,     5.5,    18.7,    30.2,20337.5,  0.00174,      7,      0,       0,       0,       0,       0
FAILURE
java.lang.RuntimeException: Failed to execute stress action
    at org.apache.cassandra.stress.StressAction.run(StressAction.java:101)
    at org.apache.cassandra.stress.Stress.run(Stress.java:143)
    at org.apache.cassandra.stress.Stress.main(Stress.java:62)

CI job continues to run and only when teardown starts we get a Critical SCT event:

2023-09-09 00:23:23.419: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=4349e27d-9dbe-4f25-8284-740ce5cb3612 during_nemesis=MgmtCorruptThenRepair duration=6h0m8s: node=Node longevity-encrypt-at-rest-200gb-6h--loader-node-ee769899-1 [54.227.37.60 | 10.12.9.205] (seed: False)
stress_cmd=cassandra-stress read cl=ONE duration=360m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=50  -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..200200300 -log interval=5
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
com.datastax.driver.core.exceptions.Transport

Nemesis-related config options:

nemesis_class_name: 'SisyphusMonkey'                                                                              
nemesis_selector: ['limited']

Nemesis attrs:

class MgmtCorruptThenRepair(Nemesis):                                                                   
    manager_operation = True                                                                            
    disruptive = True                                                                                   
    kubernetes = True                                                                                   
    limited = True                                                                                      

    def disrupt(self):                                                                                  
        self.disrupt_mgmt_corrupt_then_repair()

How frequently does it reproduce?

100%

Installation details

Kernel Version: 5.15.0-1044-aws Scylla version (or git commit hash): 2023.1.1-20230906.f4633ec973b0 with build-id b454e7a22f80cf71a33b2f39e47127225e8fbc13

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-encrypt-at-rest-200gb-6h--db-node-ee769899-4 (54.221.55.24 | 10.12.10.6) (shards: 14)
longevity-encrypt-at-rest-200gb-6h--db-node-ee769899-3 (54.226.167.206 | 10.12.8.182) (shards: 14)
longevity-encrypt-at-rest-200gb-6h--db-node-ee769899-2 (3.90.220.143 | 10.12.8.57) (shards: 14)
longevity-encrypt-at-rest-200gb-6h--db-node-ee769899-1 (18.215.156.250 | 10.12.8.73) (shards: 14)

OS / Image: ami-00ce399fc50357db3 (aws: undefined_region)

Test: vp-EaR-longevity-kms-200gb-6h-test Test id: ee769899-db49-464e-87c2-98563d8d63b2 Test name: scylla-staging/valerii/vp-EaR-longevity-kms-200gb-6h-test Test config file(s):

longevity-encryption-at-rest-200GB-6h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor ee769899-db49-464e-87c2-98563d8d63b2` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=ee769899-db49-464e-87c2-98563d8d63b2) - Show all stored logs command: `$ hydra investigate show-logs ee769899-db49-464e-87c2-98563d8d63b2` ## Logs: - **db-cluster-ee769899.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/db-cluster-ee769899.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/db-cluster-ee769899.tar.gz) - **sct-runner-events-ee769899.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/sct-runner-events-ee769899.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/sct-runner-events-ee769899.tar.gz) - **sct-ee769899.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/sct-ee769899.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/sct-ee769899.log.tar.gz) - **loader-set-ee769899.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/loader-set-ee769899.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/loader-set-ee769899.tar.gz) - **monitor-set-ee769899.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/monitor-set-ee769899.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/ee769899-db49-464e-87c2-98563d8d63b2/20230909_003711/monitor-set-ee769899.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-EaR-longevity-kms-200gb-6h-test/29/) [Argus](https://argus.scylladb.com/test/b8ec9629-422a-4c8d-9e30-7949c5f80f21/runs?additionalRuns[]=ee769899-db49-464e-87c2-98563d8d63b2)

vponomaryov commented 1 year ago

@roydahan , @fruch , @ShlomiBalalis ^

vponomaryov commented 1 year ago

Ran the job only with the problematic nemesis and looks like the stress command with that cl=ONE config generates error SCT events only for the first attempt of the nemesis, allows to run job till the tearDown and then fails with the critical SCT event:

2023-09-12 18:23:30.494: (CassandraStressEvent Severity.CRITICAL) period_type=end event_id=29ef847c-52bc-4752-87ad-743d4da92727 during_nemesis=MgmtCorruptThenRepair duration=3h0m9s: node=Node longevity-encrypt-at-rest-200gb-6h--loader-node-f22364f2-1 [52.73.213.1 | 10.12.10.227] (seed: False)
stress_cmd=cassandra-stress read cl=ONE duration=180m -schema 'replication(strategy=NetworkTopologyStrategy,replication_factor=3) compaction(strategy=SizeTieredCompactionStrategy)' -mode cql3 native -rate threads=50  -col 'size=FIXED(200) n=FIXED(5)' -pop seq=1..200200300 -log interval=5
errors:
Stress command completed with bad status 1: Failed to connect over JMX; not collecting these stats
com.datastax.driver.core.exceptions.Transport

Installation details

Kernel Version: 5.15.0-1044-aws Scylla version (or git commit hash): 2023.1.1-20230906.f4633ec973b0 with build-id b454e7a22f80cf71a33b2f39e47127225e8fbc13

Cluster size: 4 nodes (i4i.4xlarge)

Scylla Nodes used in this run:

longevity-encrypt-at-rest-200gb-6h--db-node-f22364f2-4 (174.129.182.203 | 10.12.11.171) (shards: 14)
longevity-encrypt-at-rest-200gb-6h--db-node-f22364f2-3 (54.83.78.200 | 10.12.11.54) (shards: 14)
longevity-encrypt-at-rest-200gb-6h--db-node-f22364f2-2 (54.226.73.247 | 10.12.8.228) (shards: 14)
longevity-encrypt-at-rest-200gb-6h--db-node-f22364f2-1 (54.234.18.211 | 10.12.9.161) (shards: 14)

OS / Image: ami-00ce399fc50357db3 (aws: undefined_region)

Test: vp-EaR-longevity-kms-200gb-6h-test Test id: f22364f2-0e84-4e84-84ee-d3c3b89566a9 Test name: scylla-staging/valerii/vp-EaR-longevity-kms-200gb-6h-test Test config file(s):

longevity-encryption-at-rest-200GB-6h.yaml

Logs and commands

- Restore Monitor Stack command: `$ hydra investigate show-monitor f22364f2-0e84-4e84-84ee-d3c3b89566a9` - Restore monitor on AWS instance using [Jenkins job](https://jenkins.scylladb.com/view/QA/job/QA-tools/job/hydra-show-monitor/parambuild/?test_id=f22364f2-0e84-4e84-84ee-d3c3b89566a9) - Show all stored logs command: `$ hydra investigate show-logs f22364f2-0e84-4e84-84ee-d3c3b89566a9` ## Logs: - **db-cluster-f22364f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/db-cluster-f22364f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/db-cluster-f22364f2.tar.gz) - **sct-runner-events-f22364f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/sct-runner-events-f22364f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/sct-runner-events-f22364f2.tar.gz) - **sct-f22364f2.log.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/sct-f22364f2.log.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/sct-f22364f2.log.tar.gz) - **loader-set-f22364f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/loader-set-f22364f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/loader-set-f22364f2.tar.gz) - **monitor-set-f22364f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/monitor-set-f22364f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/monitor-set-f22364f2.tar.gz) - **parallel-timelines-report-f22364f2.tar.gz** - [https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/parallel-timelines-report-f22364f2.tar.gz](https://cloudius-jenkins-test.s3.amazonaws.com/f22364f2-0e84-4e84-84ee-d3c3b89566a9/20230912_183737/parallel-timelines-report-f22364f2.tar.gz) [Jenkins job URL](https://jenkins.scylladb.com/job/scylla-staging/job/valerii/job/vp-EaR-longevity-kms-200gb-6h-test/30/) [Argus](https://argus.scylladb.com/test/b8ec9629-422a-4c8d-9e30-7949c5f80f21/runs?additionalRuns[]=f22364f2-0e84-4e84-84ee-d3c3b89566a9)

fruch commented 1 year ago

I'm not sure why encryption at rest case is using cl=ONE on it's main load, this is highly unusual for a main load. and one should need a very very good reason for running SCT with nemesis like that. (maybe it was for validation before introducing nemesis, and was forgotten)

fruch commented 1 year ago

and based on @roydahan comment I think MgmtCorruptThenRepair was wrongly introduced into the limited group

class MgmtCorruptThenRepair(Nemesis):
    manager_operation = True
    disruptive = True
    kubernetes = True
    limited = True

    def disrupt(self):
        self.disrupt_mgmt_corrupt_then_repair()

@roydahan we should rename that group, and document it clearly

this says nothing of importance to anyone...

limited: bool = False           # flag that signal that nemesis are belong to limited set of nemesises

ShlomiBalalis commented 1 year ago

and based on @roydahan comment I think MgmtCorruptThenRepair was wrongly introduced into the limited group
class MgmtCorruptThenRepair(Nemesis):
    manager_operation = True
    disruptive = True
    kubernetes = True
    limited = True

    def disrupt(self):
        self.disrupt_mgmt_corrupt_then_repair()
@roydahan we should rename that group, and document it clearly

this says nothing of importance to anyone...
limited: bool = False           # flag that signal that nemesis are belong to limited set of nemesises

I definitely agree. It's a meaningless name.

roydahan commented 1 year ago

If CorruptThenRepair isn't "limited", the same with manager cannot be limited.
Too late for name change, it's already a "brand"

fruch commented 1 year ago

If CorruptThenRepair isn't "limited", the same with manager cannot be limited.

I think the labels were copied from other manager nemesis, and not from CorruptThenRepair

CorruptThenRepair is not in the limited group

Too late for name change, it's already a "brand"

Yeah, a bad one. should be renamed and documented properly in the code or even it's own markdown file if it's not enough

vponomaryov commented 1 year ago

If CorruptThenRepair isn't "limited", the same with manager cannot be limited.

I think the labels were copied from other manager nemesis, and not from CorruptThenRepair

CorruptThenRepair is not in the limited group

The failure happens before the mgmt involvement. It happens after scylla service start:

2023-09-08 17:24:56,179 f:tester.py       l:626  c:LongevityTest        p:INFO  > Test start time 2023-09-08 17:24:56, duration is 550 and timeout set to 2023-09-09 02:34:56
...
NEMESIS
disrupt_mgmt_corrupt_then_repair    longevity-encrypt-at-rest-200gb-6h--db-node-ee769899-2  Skipped 2023-09-09 00:00:44 2023-09-09 00:24:35
...
2023-09-09 00:01:36,592 f:nemesis.py      l:1053 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: SStables amount to destroy (50 percent of all SStables): 79
...
2023-09-09 00:01:37,814 f:nemesis.py      l:1068 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Files /var/lib/scylla/data/keyspace1/standard1-ce6a94f04e6c11eebcd13adf5291bd09/me-9512-big-Data.db were destroyed
...
2023-09-09 00:02:12,185 f:nemesis.py      l:1068 c:sdcm.nemesis         p:DEBUG > sdcm.nemesis.SisyphusMonkey: Files /var/lib/scylla/data/keyspace1/standard1-ce6a94f04e6c11eebcd13adf5291bd09/me-10059-big-Data.db were destroyed
...
2023-09-09 00:02:12,185 f:remote_base.py  l:520  c:RemoteLibSSH2CmdRunner p:DEBUG > Running command "sudo systemctl start scylla-server.service"...
...
2023-09-09 00:02:16,514 f:base.py         l:228  c:RemoteLibSSH2CmdRunner p:DEBUG > java.io.IOException: Operation x0 on key(s) [314d4e4e374c314d3031]: Data returned was not validated
...
2023-09-09 00:04:00,154 f:cli.py          l:645  c:sdcm.mgmt.cli        p:DEBUG > Created task id is: repair/ed93feab-2a38-4a02-a42b-a0f4b1d8d594

So, +1 to @fruch here.

Too late for name change, it's already a "brand"

Yeah, a bad one. should be renamed and documented properly in the code or even it's own markdown file if it's not enough

+1 here too. We are not selling anything to anyone using the SCT. If something doesn't help us and even provide harm then we should change it.

roydahan commented 1 year ago

Feel free to suggest a name in the PR that fixes it. I'll keep calling it "Limited" :)

scylladb / scylla-cluster-tests