Restore should be able to skip corrupted sstables

nopzdk commented 3 weeks ago

When using the restore function and some sstables are malformed the process stops with following error, we should be able to skip these error and record them in the logs. Right now the only alternative is to do a manual restore.

Oct 23 00:30:40 ip-10-163-152-89.eu-west-2.compute.internal scylla-manager[321652]: {"L":"ERROR","T":"2024-10-23T00:30:40.613Z","N":"restore","M":"Failed to restore files on host","host":"10.163.144.187","error":"restore batch: call load and stream: giving up after 10 attempts: agent [HTTP 500] Failed to load new sstables: sstables::malformed_sstable_exception (CompressionInfo is malformed: zero chunk_len in sstable /var/lib/scylla/data/keyspace/table-d6625af07c2c11ef8e18d74ce9f9e7bf/upload/md-4848-big-CompressionInfo.db)","_trace_id":"iZTtItQxRpqST4OC7YkD3w","errorStack":"[github.com/scylladb/scylla-manager/v3/pkg/service/restore.(*tablesDirWorker).restore.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/restore/tablesdir_worker.go:143\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parallel.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:72\nruntime.goexit\n\truntime/asm_amd64.s:1695\n](http://github.com/scylladb/scylla-manager/v3/pkg/service/restore.(*tablesDirWorker).restore.func1%5Cn%5Ctgithub.com/scylladb/scylla-manager/v3/pkg/service/restore/tablesdir_worker.go:143%5Cngithub.com/scylladb/scylla-manager/v3/pkg/util/parallel.Run.func1%5Cn%5Ctgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:72%5Cnruntime.goexit%5Cn%5Ctruntime/asm_amd64.s:1695%5Cn)","S":"[github.com/scylladb/go-log.Logger.log\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:101\ngithub.com/scylladb/go-log.Logger.Error\n\tgithub.com/scylladb/go-log@v0.0.7/logger.go:84\ngithub.com/scylladb/scylla-manager/v3/pkg/service/restore.(*tablesDirWorker).restore.func2\n\tgithub.com/scylladb/scylla-manager/v3/pkg/service/restore/tablesdir_worker.go:153\ngithub.com/scylladb/scylla-manager/v3/pkg/util/parallel.Run.func1\n\tgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79](http://github.com/scylladb/go-log.Logger.log%5Cn%5Ctgithub.com/scylladb/go-log@v0.0.7/logger.go:101%5Cngithub.com/scylladb/go-log.Logger.Error%5Cn%5Ctgithub.com/scylladb/go-log@v0.0.7/logger.go:84%5Cngithub.com/scylladb/scylla-manager/v3/pkg/service/restore.(*tablesDirWorker).restore.func2%5Cn%5Ctgithub.com/scylladb/scylla-manager/v3/pkg/service/restore/tablesdir_worker.go:153%5Cngithub.com/scylladb/scylla-manager/v3/pkg/util/parallel.Run.func1%5Cn%5Ctgithub.com/scylladb/scylla-manager/v3/pkg/util/parallel/parallel.go:79)"}

Michal-Leszczynski commented 3 weeks ago

Since not all SSTables would be downloaded and load&streamed to the cluster, SM couldn't continue the restore procedure (e.g. run repair, re-build views, re-enable tombstone_gc), as it could leave cluster data in undefined state (e.g. data resurrection (or even corruption) could be a thing).

So except for restoring the malformed SSTable manually, user would also need to continue the restore procedure manually. We could extend SM so that it's possible to specify that given SSTable has been restored manually (no way of verifying that by SM), and ordering it to continue restore procedure from this point.

So this feature would require some planning and probably won't be added in the near future.

karol-kokoszka commented 2 weeks ago

Let's bring this issue to Scylla Manager planning to check if we want to implement such an approach in Scylla Manager or not.

cc: @tzach

scylladb / scylla-manager

Restore should be able to skip corrupted sstables #4093