RFC: import info from Amazon DDB export to S3

tzach commented 2 months ago

It would be useful to import Amazon DDB S3 exports to ScyllaDB using the migrator. The JSON format of the is detailed here: https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/S3DataExport.Output.html

julienrf commented 2 months ago

Here is a plan to support this feature:

Introduce a new type of source, DynamoDBS3Export containing the information needed to access the bucket (target bucket, export prefix, AWS credentials).
Use an S3 client to access the bucket and read the details about the export in the manifest-summary.json (export format and export type).
Use the S3 client to read the actual data.
Implement the necessary logic to handle both file formats (JSON Lines and Amazon Ion)
Implement the necessary logic to handle both export types (full and incremental)
Test the feature on real data in AWS
Write documentation
Implement integration tests to run in the CI (might be hard to achieve because we would have to replicate the AWS infrastructure locally)

guy9 commented 2 months ago

Additional discussion that I'm moving here:

@julienrf : The reporter would like the Migrator to be able to import DynamoDB data that have been exported to S3 using a standard export feature in AWS. Exports can use two formats: JSON Lines or Amazon Ion. Also, there are two possible types of exports: full or incremental. We would have to use an S3 client to access the export. Parsing the JSON Lines format should be pretty straightforward, but parsing the Amazon Ion format will require additional work. Supporting incremental exports would require specific logic too. This feature will not be easy to CI-test because it relies on AWS, although it might be possible to emulate the AWS environment in Docker containers using LocalStack or MinIO. I think it should be possible to implement, document, and test the feature in one day (full exports in JSON Lines only). Supporting the Amazon Ion format may take a couple more hours. Supporting incremental exports would take a few more hours. Setting up CI tests would also take several extra hours. So, we should count 2 days of work for the complete support and continuous integration tests.

@guy9 : Understanding that both formats can be exported (Ion and Json), Is there a reason to support the Ion format and no just the Json format?

@julienrf : I don’t know if Ion is used a lot or not. But in case it is popular, it would make sense to support it. Compared to JSON, it seems to provide slightly richer types of values (e.g. they have a proper date-time type whereas in JSON we would use a string) and a binary encoding format (which I assume is denser than the usual text encoding). They maintain a Java library to work with Ion, so we would not have to re-implement everything.

scylladb / scylla-migrator

RFC: import info from Amazon DDB export to S3 #136