scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra/parquet files. Alt. from DynamoDB to Scylla Alternator.
https://migrator.docs.scylladb.com/stable/
Apache License 2.0
55 stars 34 forks source link

Unify Dynamo sources and targets; CDC for Dynamo tables #23

Closed iravid closed 4 years ago

iravid commented 4 years ago

Summary of changes

  1. Spark was upgraded to 2.4.4, so this resolves #21.
  2. The previous DynamoDB migrator, which used a separate entrypoint, is unified into the Migrator entrypoint. Dynamo migrations are now configured by providing source and target configurations that use type: dynamodb.
  3. For that, we now depend on a forked version of https://github.com/audienceproject/spark-dynamodb. See the Forked Libraries section for details on why the library was forked.
  4. Dynamo targets now support a streamChanges parameter when the source is a DynamoDB table. This will cause a DynamoDB Stream to be enabled on the source table before the snapshot migration is started. After the migration is complete, the changes from the stream will be consumed and applied to the target table.
  5. To support that, we now depend on https://github.com/iravid/spark-kinesis. See Forked Libraries for details on why the library was forked.

How this was tested

Using https://github.com/iravid/migrator-dynamo-mutator, a DynamoDB table is created and random mutations are applied to it. Concurrently, the migrator is started with that table as a source and a local Alternator instance as a target. The mutator applies random mutations until the user signals to stop. Then it compares the source and target tables. I performed several experiments that showed 0 differences (symmetrically).

Limitations

Forked Libraries

  1. spark-dynamodb was forked in order to support: a. Static credentials b. Custom endpoints c. Nulls on table size and item counts returned from DescribeTable (a current Alternator implementation detail) d. Writing mixed batches of puts and deletes - required for applying changes from the DynamoDB Stream to the target table e. Make the conversion utilities from DynamoDB documents to Spark rows public
  2. spark-kinesis was forked in order to add support for the DynamoDB Streams client. This project is currently maintained inside the Spark repository.

Of these two, I intend to work on upstreaming the changes in spark-dynamodb further on.

iravid commented 4 years ago

@tarzanek Would you like to test-drive this?

iravid commented 4 years ago

Hmm the updated version of the Cassandra connector caused compilation to break. I missed this locally because I had a jar of the previous version.

Fix incoming.

tarzanek commented 4 years ago

let me try it, it looks good from the quick look I had

iravid commented 4 years ago

Awesome, thanks!

iravid commented 4 years ago

@tarzanek Going to go ahead and merge this. Still interested in your experience if you get the chance!