Use sstable indetifier for deduplication instead of sstable generation ID

Michal-Leszczynski commented 1 month ago

Recently, Scylla merged https://github.com/scylladb/scylladb/pull/21002. We should use it for sstable deduplication instead of the currently used generation ID approach, as it has the following benefits:

it is resilient to sstable migration - meaning that sstable identifier stays the same after sstable migration (not the case for generation ID)
it is safer to use than deduplicating sstables with int based generaion IDs by their name/size/.crc32

The second argument is self explanatory. In terms of the first one, we would need to create a design doc specifying how would the deduplication/upload handle the case when an sstable is already present in the backup location, but with different ID and under a different node path.

Michal-Leszczynski commented 1 month ago

cc: @karol-kokoszka @regevran @bhalevy

Right now I'm not sure how to implement this. SM backup structure keeps different node's sstables under different paths. Also, what happens if we use EAR and different DCs use different encryption key? Consider the following scenario:

sstable was backed up in dc1
sstable migrated to dc2
we want to back up again Does this sstable have the same sstable identifier? If so, we should still back it up, as for some reason user might want to restore data from only a single dc, or they lost their encryption key from dc1, and so on.

regevran commented 1 month ago

SM backup structure keeps different node's sstables under different paths.

Does it mean that in order to look for duplication we need to seek all paths?

different encryption key

Encryption may be changed on the very same node the backup was taken from as well (between the backup and the restore) Therefore we decrypt/encrypt all the time. for example when you backup a file: read encrypted from disk --> decrype --> network encrypt (node) -> network decrypt (S3) --> encrypt before store --> store

bhalevy commented 1 month ago

cc: @karol-kokoszka @regevran @bhalevy

Right now I'm not sure how to implement this. SM backup structure keeps different node's sstables under different paths. Also, what happens if we use EAR and different DCs use different encryption key? Consider the following scenario:

sstable was backed up in dc1

sstable migrated to dc2

This should never happen. Typically (when rf==num_racks) tablets are migrated only within racks. More rarely tablets may be migrated across racks, but they are never migrated across DCs. They only way they would get there is maybe restore across DCs.

we want to back up again Does this sstable have the same sstable identifier? If so, we should still back it up, as for some reason user might want to restore data from only a single dc, or they lost their encryption key from dc1, and so on.

bhalevy commented 1 month ago

As for encryption keys, we care about the sstable's contents and not its representation. In the future I suggest we even store it unencrypted on object storage since S3 has EaR of its own. And then on restore we can re-encrypt it

mykaul commented 1 month ago

This feature is only going to be available in 2025.1 (Enterprise) / 6.3 (OSS), so I'm not sure we should use it until it's available and widely used.

regevran commented 1 month ago

This feature is only going to be available in 2025.1 (Enterprise) / 6.3 (OSS), so I'm not sure we should use it until it's available and widely used.

Do you mean that we'll have a long transition time? I guess we need to support deduplication in SM for all sstables options: number-generated ids, uuid without sstable identifier, sstable with a unique identifier.

karol-kokoszka commented 1 month ago

Tablets are migrating between the nodes + during the migration, the SSTable name can change. It definitely makes the deduplication process introduced for VNodes cluster ineffective when the cluster is working on tablets, because the current deduplication process relies on the SSTable bundle name + the node where it belongs to.

When the tablet is migrating, it means that new SSTable bundle name is generated + it may change the node where it belongs to. Even though the content of the SSTable stays the same, and this data is already back up, Scylla Manager is not aware of it, so eventually it's going to copy new SSTable to the backup destination. The originally backed up SSTable (the one that migrated to other name/node) is going to be removed by the purge stage when SM realises that there is no reference to it from "live" snapshots.

Let me put some summary of already identified problems.

SSTable is identified by the e.g. node-id (among others). Tablets makes this identification not valid anymore.
SSTable are changing the name when migrating, the ID is saved to metadata. How to get this information (@regevran) ? Is this metadata added to some new format/vesion of the SSTable. Or it extends already existing one? I'm refering to this list of SSTable format versions https://opensource.docs.scylladb.com/stable/architecture/sstable/

The fact is that deduplication in SM is not gonna work efficiently for Scylla Enterprise 2024.2, when tablets are enabled.. The backup storage is going to keep more data of the cluster that is needed.

The Encryption at Rest will bother us only if tablets can migrate between the Datacenters, but I understand this is not the case.

regevran commented 1 month ago

How to get this information

I think that with dump-scylla-metada. if sstable-identifier is indeed dumped with this command - we better document it is.

bhalevy commented 1 month ago

How to get this information

I think that with dump-scylla-metada. if sstable-identifier is indeed dumped with this command - we better document it is.

See scylladb/scylladb#21221

scylladb / scylla-manager

Use sstable indetifier for deduplication instead of sstable generation ID #4069