[SCT] Pipeline to test repair

karol-kokoszka commented 1 month ago

Pipeline for Scylla Manager that will execute the repair against X Gb amount of data cluster.

Idea:

upload X amount of data to the cluster with the assumption that when uploading some amount of data, few nodes are offline. Oflline nodes and amount of data should be mixed.

This is the prerequisite to validate correctness of https://github.com/scylladb/scylla-manager/issues/3792 Specifically this comment and the follow ups below https://github.com/scylladb/scylla-manager/issues/3792#issuecomment-2115440255

mikliapko commented 1 month ago

Just some observations:

The current sanity test partially covers repair functionality. I've executed it against Scylla 6.0 - it failed with:
```
AssertionError: The keyspace system_auth was not included in the repair!
```
https://jenkins.scylladb.com/job/scylla-staging/job/mikita/job/sct/job/manager-master/job/ubuntu22-sanity-test-tablets/4/console
SCT already have a couple of tests for repair (there were no Jenkins jobs for them):

2.1 _test_repaircontrol (9 nodes cluster, singleDC + 4 loaders):

C-S write, ~300Gb;
do flush for every node;
wait for compactions to end;
start C-S read load in background (400 minutes);
test repair: -- one by one *stop scylla-server on one of the nodes and C-S write ~35Gb; -- alter table (WITH read_repair_chance = 0.0); -- do flush for every node; -- wait for compactions to end; -- do repair by 20% block, each block with different parameters: --- default; --- {"intensity": .0001}; --- {"intensity": 0}; --- {"parallel": 1}; --- {"intensity": 2, "parallel": 1}; -- wait for repair status DONE;
wait for C-S read finishing.

2.2 test_repair_intensity_feature_on_single_node (6 nodes cluster, singleDC + 4 loaders):

C-S write, ~200Gb;
do flush for every node;
wait for compactions to end;
start C-S read load in background (1500 minutes);
test repair: -- take one of the nodes -> stop scylla server -> delete keyspace directory (/var/lib/scylla/data/keyspace_name) -> start scylla server; -- create repair task (no intensity) and wait for DONE status; -- repeat the same repair test for: --- intensities: .5, .25, .0001, 2, 4, 100, 0 --- parallel: 1, 2 --- mixed: {"intensity": 2, "parallel": 1}
wait for C-S read finishing.

2.3 _test_repair_intensity_feature_on_multiplenode (9 nodes cluster, singleDC + 4 loaders):

C-S write, ~200Gb;
do flush for every node;
wait for compactions to end;
start C-S read load in background (1020 minutes);
test repair: -- one by one stop scylla-server on one of the nodes and C-S write ~30Gb; -- alter table (WITH read_repair_chance = 0.0); -- do flush for every node; -- wait for compactions to end; -- create repair task (no intensity) and wait for DONE status; -- repeat the same repair test for: --- intensities: .5, .25, .0001, 2, 4, 100, 0 --- parallel: 1, 2 --- mixed: {"intensity": 2, "parallel": 1}
wait for C-S read finishing.

@karol-kokoszka @Michal-Leszczynski Please, take a look into described tests. I think they cover your request.

The only question, I'm not sure about C-S read duration. I did a test run for test_repair_control (2.1). It took ~7 minutes to execute repair test, the rest of the time ~(400 - 7) the test just waited for C-S read operation finishing. I'm not really sure we need to run such a long C-S read for repair verification. Every test needs 10+ EC2 instances and costs us a lot. @ShlomiBalalis Could you please assist with that as the author of these tests?

Michal-Leszczynski commented 1 month ago

AssertionError: The keyspace system_auth was not included in the repair!

This is expected. In Scylla 6.0 there is no system_auth keyspace anymore (unless it is a leftover after upgrade). Auth info is stored in either system_auth_v2 (new keyspace) or as a part of system keyspace (don't remember which option was chosen). Since auth is managed by raft in Scylla 6.0, there is no need for repairing it.

Michal-Leszczynski commented 1 month ago

@mikliapko what is the RF for those tests? Also, float intensities have been deprecated for some time, any float intensity in (0, 1) is rounded up to 1, so it doesn't make much sense to test them.

Michal-Leszczynski commented 1 month ago

I'm not really sure we need to run such a long C-S read for repair verification. Every test needs 10+ EC2 instances and costs us a lot.

Are those C-S reads aimed at checking repair correctness or to simulate traffic (I would think that the latter, as we are checking different intensity settings here)? It's generally good idea to test repair on a cluster serving traffic as it is more realistic scenario, but C-S could just be canceled after repair has finished. Is there a possibility to do it from the scope of the test?

mikliapko commented 4 weeks ago

@mikliapko what is the RF for those tests?

RF is equal to 3 for all tests.

Also, float intensities have been deprecated for some time, any float intensity in (0, 1) is rounded up to 1, so it doesn't make much sense to test them.

Okay, I'll remove these intensities from tests in such case.

mikliapko commented 4 weeks ago

Are those C-S reads aimed at checking repair correctness or to simulate traffic (I would think that the latter, as we are checking different intensity settings here)? It's generally good idea to test repair on a cluster serving traffic as it is more realistic scenario, but C-S could just be canceled after repair has finished. Is there a possibility to do it from the scope of the test?

I suppose the both points may be relevant here. @ShlomiBalalis Could you please clarify this moment?

mikliapko commented 3 weeks ago

Added grooming-needed label to discuss possible changes in test configuration

karol-kokoszka commented 3 weeks ago

refirement notes

We have SCT test available already, it's included into jenkins folder for scylla manager CI https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/repair-control-test/

There is no need to execute cassandra-stress reading the data for 400 minutes during the test, as the repair last much shorter than that, what means the SCT last for long time, but the assertion (repair itself) is completed much faster.

From @Michal-Leszczynski: " It makes no sense to use such a strong machines like i4i.2xlarge. Run the test on e.g. 6 nodes cluster and on weaker machines. Metrics can be used to check what is the load during the test execution and we expect to keep to load on reasonably high level. Expectation is to have the reasonably high traffic + reasonably high amount of data to repair. There are two factors, machines that are used by scylla and amount of data. It's easier to just use weaker machines than putting TBs of data into the cluster. "

Metric for load taken from monitoring stack -> avg(scylla_reactor_utilization{instance="",cluster="", dc="", shard=""} )

mikliapko commented 3 weeks ago

This is expected. In Scylla 6.0 there is no system_auth keyspace anymore (unless it is a leftover after upgrade). Auth info is stored in either system_auth_v2 (new keyspace) or as a part of system keyspace (don't remember which option was chosen). Since auth is managed by raft in Scylla 6.0, there is no need for repairing it.

@Michal-Leszczynski Could you please post the link to the ticket or documentation where this change is described if you have such? Wanna link it to the PR with test adjustments.

karol-kokoszka commented 5 days ago

@mikliapko Is this task still in progress ? We used the SCT to test the repair for tablets. Do you plan to make some other changes here as well ? This issue is part of milestone "manager-3.3.0" and I wonder if we could close it.

mikliapko commented 5 days ago

@mikliapko Is this task still in progress ? We used the SCT to test the repair for tablets. Do you plan to make some other changes here as well ? This issue is part of milestone "manager-3.3.0" and I wonder if we could close it.

Yep, we used these tests already but runs were made from our forks that contained some adjustments about tablets and load cmd duration.

There is also a suggestion from Michal to use cheaper instances which should be tested.

Moreover, I hope to hear something from @ShlomiBalalis about the questions raised above.

So, overall, I'd prefer to keep it open as there is some work that still should be done here.

scylladb / scylla-manager

[SCT] Pipeline to test repair #3867