scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
51 stars 33 forks source link

Experiment with speeding up the repair #3791

Open Michal-Leszczynski opened 5 months ago

Michal-Leszczynski commented 5 months ago

We are aware that SM 3.2 repair changes improved stability, but also slowed down the repair. For some use cases it's good, but sometimes it is rough when the repair works for many days (see https://github.com/scylladb/scylla-enterprise/issues/4006). Introducing small table optimization (#3642) will surely help, but it won't solve the entire problem. I see two possible improvements:

### Tasks
- [ ] #3789
- [ ] #3790

Both of them are actually rather small changes on SM side. The bigger problem could be with testing - this would require QA assistance. None of those improvements would be used by default - they could be optionally turned on by proper SM flags or config.

@karol-kokoszka @vladzcloudius @asias I know that you have been discussing similar topics on various issues, but I would like this task list to summarize your opinions. FYI @tzach

A-Posthuman commented 4 months ago

+1 for both of these improvements being needed. On our very lightly loaded (8% cpu typical during repair) cluster of 3 AWS im4gn.4xlarge, for some reason the repair process is taking upwards of 3 weeks at maximum intensity currently on latest SM 3.2.8.