scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

Do not try to infer a schema when migrating from DynamoDB to Alternator #105

Closed julienrf closed 4 months ago

julienrf commented 4 months ago

Fixes #103.

Instead of using com.audienceproject:spark-dynamodb to migrate the data, we use com.amazon.emr:emr-dynamodb-hadoop as described in https://aws.amazon.com/fr/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/.

The former approach used to load the data as a DataFrame, which required us to infer the data schema, but that was not doable in a reliable way.

The new approach still benefits from the Spark infrastructure to handle the data transfer efficiently, but the data is loaded as an RDD.

To achieve this, I had to de-unify the migrators for Scylla and Alternator (so, in a sense, I am undoing what was done in #23). The benefit is that the Alternator migrator is not anymore constrained by the requirements of the Scylla migrator.

tarzanek commented 4 months ago

@julienrf please continue here goal is to get rid of the audience project dependency (fork) and ideally also get rid of our kinesis fork

tarzanek commented 4 months ago

merging, thank you @julienrf !