scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

RFC: import info from Amazon DDB export to S3 #136

Closed tzach closed 1 month ago

tzach commented 2 months ago

It would be useful to import Amazon DDB S3 exports to ScyllaDB using the migrator. The JSON format of the is detailed here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html

julienrf commented 2 months ago

Here is a plan to support this feature:

guy9 commented 2 months ago

Additional discussion that I'm moving here:

@julienrf : The reporter would like the Migrator to be able to import DynamoDB data that have been exported to S3 using a standard export feature in AWS. Exports can use two formats: JSON Lines or Amazon Ion. Also, there are two possible types of exports: full or incremental. We would have to use an S3 client to access the export. Parsing the JSON Lines format should be pretty straightforward, but parsing the Amazon Ion format will require additional work. Supporting incremental exports would require specific logic too. This feature will not be easy to CI-test because it relies on AWS, although it might be possible to emulate the AWS environment in Docker containers using LocalStack or MinIO. I think it should be possible to implement, document, and test the feature in one day (full exports in JSON Lines only). Supporting the Amazon Ion format may take a couple more hours. Supporting incremental exports would take a few more hours. Setting up CI tests would also take several extra hours. So, we should count 2 days of work for the complete support and continuous integration tests.

@guy9 : Understanding that both formats can be exported (Ion and Json), Is there a reason to support the Ion format and no just the Json format?

@julienrf : I don’t know if Ion is used a lot or not. But in case it is popular, it would make sense to support it. Compared to JSON, it seems to provide slightly richer types of values (e.g. they have a proper date-time type whereas in JSON we would use a string) and a binary encoding format (which I assume is denser than the usual text encoding). They maintain a Java library to work with Ion, so we would not have to re-implement everything.