The former approach used to load the data as a DataFrame, which required us to infer the data schema, but that was not doable in a reliable way.
The new approach still benefits from the Spark infrastructure to handle the data transfer efficiently, but the data is loaded as an RDD.
To achieve this, I had to de-unify the migrators for Scylla and Alternator (so, in a sense, I am undoing what was done in #23). The benefit is that the Alternator migrator is not anymore constrained by the requirements of the Scylla migrator.
Fixes #103.
Instead of using
com.audienceproject:spark-dynamodb
to migrate the data, we usecom.amazon.emr:emr-dynamodb-hadoop
as described in https://aws.amazon.com/fr/blogs/big-data/analyze-your-data-on-amazon-dynamodb-with-apache-spark/.The former approach used to load the data as a
DataFrame
, which required us to infer the data schema, but that was not doable in a reliable way.The new approach still benefits from the Spark infrastructure to handle the data transfer efficiently, but the data is loaded as an
RDD
.To achieve this, I had to de-unify the migrators for Scylla and Alternator (so, in a sense, I am undoing what was done in #23). The benefit is that the Alternator migrator is not anymore constrained by the requirements of the Scylla migrator.