scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Repair: use ranges parallelism #3866

Closed Michal-Leszczynski closed 2 weeks ago

Michal-Leszczynski commented 1 month ago

This PR introduces token ranges batching in repair jobs. Previously, with intensity=X, SM would simply send X token ranges per job. As described in #3789, this approach could lead to cluster being underutilized. Batching also solves #3792, as it increases the odds of repair job consisting of similar amount of token ranges owned by any shard.

Fixes #3789 Fixes #3792

Michal-Leszczynski commented 1 month ago

@karol-kokoszka This PR is ready for review!

karol-kokoszka commented 1 month ago

Let's execute repair SCT to see the impact https://github.com/scylladb/scylla-manager/issues/3867

Michal-Leszczynski commented 2 weeks ago

SCT setup:

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/michal-leszczynski/job/repair-intensity-multiple-repaired-nodes-test/22/ master & vnode: 04:28:19 -> 04:36:35 ~ 8 min 16 sec

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/michal-leszczynski/job/repair-intensity-multiple-repaired-nodes-test/24/ batching & tablets: 00:43:36 -> 01:02:21 ~ 18 min 45 sec

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/michal-leszczynski/job/repair-intensity-multiple-repaired-nodes-test/25/ batching & vnode: 02:45:18 -> 02:53:07 ~ 7 min 49 sec



The problem here is that I could only set the amount of rows, not data, inserted to the cluster. If I understand it correctly, the load for vnode was around 280GB and around 560GB for tablets, but someone would need to verify my C-S math.
I first run the tablet scenarios and observed, that the SCT runs for about 4h, so that's why I decided to decrease the load so that it can finish faster.

@karol-kokoszka is this enough to merge this PR? 
karol-kokoszka commented 2 weeks ago

@karol-kokoszka is this enough to merge this PR?

Yes it is. It's a bit strange that repair for tablets lasts longer than for vnodes though.

Please merge. Thanks !

mykaul commented 2 weeks ago

SCT setup:

  • 5 i4i.2xlarge node cluster
  • inserted total of 292968720 rows for vnode scenarios and twice as much for tablet scenarios
  • inserted rows while one of the nodes were down
  • stress_command_template:
                                  " replication(strategy=NetworkTopologyStrategy,replication_factor=3)'" \
                                  " -col 'size=FIXED(1024) n=FIXED(1)' -pop seq={}..{} -mode cql3" \
                                  " native -rate threads=200 -log interval=5"

SCT results:

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/michal-leszczynski/job/repair-intensity-multiple-repaired-nodes-test/23/
master & tablets: 18:57:27 -> 19:34:25 ~ 36 min 58 sec

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/michal-leszczynski/job/repair-intensity-multiple-repaired-nodes-test/22/
master & vnode: 04:28:19 -> 04:36:35 ~ 8 min 16 sec

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/michal-leszczynski/job/repair-intensity-multiple-repaired-nodes-test/24/
batching & tablets: 00:43:36 -> 01:02:21 ~ 18 min 45 sec

https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/michal-leszczynski/job/repair-intensity-multiple-repaired-nodes-test/25/
batching & vnode: 02:45:18 -> 02:53:07 ~ 7 min 49 sec

The problem here is that I could only set the amount of rows, not data, inserted to the cluster. If I understand it correctly, the load for vnode was around 280GB and around 560GB for tablets, but someone would need to verify my C-S math. I first run the tablet scenarios and observed, that the SCT runs for about 4h, so that's why I decided to decrease the load so that it can finish faster.

@karol-kokoszka is this enough to merge this PR?

@asias - thoughts on the above?

Michal-Leszczynski commented 2 weeks ago

inserted total of 292968720 rows for vnode scenarios and twice as much for tablet scenarios

Also, this setup initially resulted in creating only 64 tablets. This is way too little for testing batching (too little token ranges owned by each replica set), so I created the tablet keyspace with 8192 initial tablets.