Open karol-kokoszka opened 1 month ago
Just some observations:
The current sanity test partially covers repair functionality. I've executed it against Scylla 6.0 - it failed with:
AssertionError: The keyspace system_auth was not included in the repair!
SCT already have a couple of tests for repair (there were no Jenkins jobs for them):
2.1 _test_repaircontrol (9 nodes cluster, singleDC + 4 loaders):
2.2 test_repair_intensity_feature_on_single_node (6 nodes cluster, singleDC + 4 loaders):
2.3 _test_repair_intensity_feature_on_multiplenode (9 nodes cluster, singleDC + 4 loaders):
@karol-kokoszka @Michal-Leszczynski Please, take a look into described tests. I think they cover your request.
The only question, I'm not sure about C-S read duration. I did a test run for test_repair_control (2.1). It took ~7 minutes to execute repair test, the rest of the time ~(400 - 7) the test just waited for C-S read operation finishing. I'm not really sure we need to run such a long C-S read for repair verification. Every test needs 10+ EC2 instances and costs us a lot. @ShlomiBalalis Could you please assist with that as the author of these tests?
AssertionError: The keyspace system_auth was not included in the repair!
This is expected. In Scylla 6.0 there is no system_auth
keyspace anymore (unless it is a leftover after upgrade). Auth info is stored in either system_auth_v2
(new keyspace) or as a part of system
keyspace (don't remember which option was chosen). Since auth is managed by raft in Scylla 6.0, there is no need for repairing it.
@mikliapko what is the RF for those tests? Also, float intensities have been deprecated for some time, any float intensity in (0, 1) is rounded up to 1, so it doesn't make much sense to test them.
I'm not really sure we need to run such a long C-S read for repair verification. Every test needs 10+ EC2 instances and costs us a lot.
Are those C-S reads aimed at checking repair correctness or to simulate traffic (I would think that the latter, as we are checking different intensity settings here)? It's generally good idea to test repair on a cluster serving traffic as it is more realistic scenario, but C-S could just be canceled after repair has finished. Is there a possibility to do it from the scope of the test?
@mikliapko what is the RF for those tests?
RF is equal to 3 for all tests.
Also, float intensities have been deprecated for some time, any float intensity in (0, 1) is rounded up to 1, so it doesn't make much sense to test them.
Okay, I'll remove these intensities from tests in such case.
Are those C-S reads aimed at checking repair correctness or to simulate traffic (I would think that the latter, as we are checking different intensity settings here)? It's generally good idea to test repair on a cluster serving traffic as it is more realistic scenario, but C-S could just be canceled after repair has finished. Is there a possibility to do it from the scope of the test?
I suppose the both points may be relevant here. @ShlomiBalalis Could you please clarify this moment?
Added grooming-needed label to discuss possible changes in test configuration
refirement notes
We have SCT test available already, it's included into jenkins folder for scylla manager CI https://jenkins.scylladb.com/view/scylla-manager/job/manager-master/job/repair-control-test/
There is no need to execute cassandra-stress reading the data for 400 minutes during the test, as the repair last much shorter than that, what means the SCT last for long time, but the assertion (repair itself) is completed much faster.
From @Michal-Leszczynski:
"
It makes no sense to use such a strong machines like i4i.2xlarge
.
Run the test on e.g. 6 nodes cluster and on weaker machines. Metrics can be used to check what is the load during the test execution and we expect to keep to load on reasonably high level.
Expectation is to have the reasonably high traffic + reasonably high amount of data to repair.
There are two factors, machines that are used by scylla and amount of data. It's easier to just use weaker machines than putting TBs of data into the cluster.
"
Metric for load taken from monitoring stack -> avg(scylla_reactor_utilization{instance="",cluster="", dc="", shard=""} )
This is expected. In Scylla 6.0 there is no
system_auth
keyspace anymore (unless it is a leftover after upgrade). Auth info is stored in eithersystem_auth_v2
(new keyspace) or as a part ofsystem
keyspace (don't remember which option was chosen). Since auth is managed by raft in Scylla 6.0, there is no need for repairing it.
@Michal-Leszczynski Could you please post the link to the ticket or documentation where this change is described if you have such? Wanna link it to the PR with test adjustments.
@mikliapko Is this task still in progress ? We used the SCT to test the repair for tablets. Do you plan to make some other changes here as well ? This issue is part of milestone "manager-3.3.0" and I wonder if we could close it.
@mikliapko Is this task still in progress ? We used the SCT to test the repair for tablets. Do you plan to make some other changes here as well ? This issue is part of milestone "manager-3.3.0" and I wonder if we could close it.
Yep, we used these tests already but runs were made from our forks that contained some adjustments about tablets and load cmd duration.
There is also a suggestion from Michal to use cheaper instances which should be tested.
Moreover, I hope to hear something from @ShlomiBalalis about the questions raised above.
So, overall, I'd prefer to keep it open as there is some work that still should be done here.
Pipeline for Scylla Manager that will execute the repair against X Gb amount of data cluster.
Idea:
This is the prerequisite to validate correctness of https://github.com/scylladb/scylla-manager/issues/3792 Specifically this comment and the follow ups below https://github.com/scylladb/scylla-manager/issues/3792#issuecomment-2115440255