scylladb / scylla-manager

The Scylla Manager
https://manager.docs.scylladb.com/stable/
Other
48 stars 33 forks source link

Add explicit deduplicate stage to SM (remove basing on .crc32 or UUID as generation ID) #3827

Open karol-kokoszka opened 2 months ago

karol-kokoszka commented 2 months ago

Scylla Enterprise 2024.1 introduced as a default support for UUIDs in SSTable names. https://github.com/scylladb/scylladb/pull/13932

Scylla manager no longer needs to compare checksums when copying files from local to storage, as the SSTable uniqueness is guaranteed by the UUID and the situation where SSTable have the same name and size, but different content is mitigated.

The above is valid for backups made on >= 2024.1 only. We still need to support old approach for older Scyllas.

Issue on QA side https://github.com/scylladb/qa-tasks/issues/1397

More details of why we have to add this support is here https://github.com/scylladb/scylla-enterprise/issues/4126#issuecomment-2075049181

karol-kokoszka commented 2 months ago

grooming notes

We must check the version of Scylla that we perform the backup task against. If the version uses UUID in the SSTable names, (>= 2024.1) then we must disable checksum comparision during the move operation performed by RClone.

(per @Michal-Leszczynski) We can check at runtime if the SSTable name includes the UUID or not. If it's not, then we must enable checksum, if it is, we must disable.

karol-kokoszka commented 2 weeks ago

Update:

SSTable specification includes .crc32 file which is actually the checksum created by scylla server. It's enough to compare the content of .crc32 inlcuded into created snapshot for every SSTable to the content of its remote equivalent. If they are equal, then just remove the SSTable from snapshot dir.

The solution above doesn't require to have Scylla supporting UUIDs in the SSTable names. It's just a general solution.

SSTable specification https://opensource.docs.scylladb.com/stable/architecture/sstable/sstable3/sstables-3-data-file-format.html

karol-kokoszka commented 1 week ago

We need to test the scenario described in the previous comment against the snapshot that contains hundreds of SSTables to see the performance of this operation. It includes reading content of every crc32 file that is part of the current snapshot.

karol-kokoszka commented 15 hours ago

RFC for the new deduplication stage: https://docs.google.com/document/d/1EtGlF6UGNy34D_7QsnCheaukp3UwVObZU56PBdd0CQ8/edit?usp=sharing