Open Michal-Leszczynski opened 5 months ago
+1 for both of these improvements being needed. On our very lightly loaded (8% cpu typical during repair) cluster of 3 AWS im4gn.4xlarge, for some reason the repair process is taking upwards of 3 weeks at maximum intensity currently on latest SM 3.2.8.
We are aware that SM 3.2 repair changes improved stability, but also slowed down the repair. For some use cases it's good, but sometimes it is rough when the repair works for many days (see https://github.com/scylladb/scylla-enterprise/issues/4006). Introducing small table optimization (#3642) will surely help, but it won't solve the entire problem. I see two possible improvements:
Both of them are actually rather small changes on SM side. The bigger problem could be with testing - this would require QA assistance. None of those improvements would be used by default - they could be optionally turned on by proper SM flags or config.
@karol-kokoszka @vladzcloudius @asias I know that you have been discussing similar topics on various issues, but I would like this task list to summarize your opinions. FYI @tzach