scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra/parquet files. Alt. from DynamoDB to Scylla Alternator.
https://migrator.docs.scylladb.com/stable/
Apache License 2.0
61 stars 36 forks source link

Enable exclude TTL expired on migrate #179

Closed pdbossman closed 2 months ago

pdbossman commented 4 months ago

With DynamoDB, the user can specify an expiration date-time in a column, and that column can be enabled/disabled to apply TTL. DynamoDB will then periodically scan the table and delete this data.

This means it's possible to be streaming a significant amount of data that is expired. This can slow the migration down by itself. It can also create a large overhang of expired items to be scanned and deleted post migration.

Due to all of the above, it'd be desirable to have an option to discard items being copied from source to target that have already expired.

julienrf commented 2 months ago

This feature could be implemented by calling DescribeTimeToLive to retrieve the column name that contain the expiration timestamp, and then filtering out items that are expired. I wonder if we should make this behavior optional at all. Are there cases where we would like to preserve expired items?

pdbossman commented 2 months ago

Validation may temporarily fail as DynamoDB would still return the records while we don't. But practically thinking, I'd always want to discard expired items...

julienrf commented 2 months ago

I don’t think validation would return the records since we will also exclude them when reading the source table.

The way this is implemented in PR #206 is by configuring the Scan operation on the source table to filter out the expired items. This means these items are not even loaded from the source database. This is the case both when we perform the migration and when we perform the validation.

Is this fine?