scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

The sandboxed testing environment cannot use AWS #113

Open julienrf opened 4 months ago

julienrf commented 4 months ago

In #107 we introduced a testing infrastructure that allows us to test several migration scenarios. Unfortunately, the streamChanges feature uses the spark-kinesis module under the hood and this module performs calls to the real AWS servers instead of using the containerized service.

Solutions to this problem could be to either fix the code of spark-kinesis to stay in the sandbox environment (this is a known issue, see https://github.com/localstack/localstack/issues/677 and https://issues.apache.org/jira/browse/SPARK-27950), or to use something else than spark-kinesis.

Related:

julienrf commented 4 days ago

Commenting on this issue instead of creating a new one because this is related to the testing infrastructure.

Currently, our testing infrastructure recreates the AWS stack (S3 and DynamoDB) in Docker containers. This works okay but comes with limitations:

While the first point could be (and should be, ideally) fixed by removing the hard-coded dependency on AWS, to address the second point we have no choice but having tests that use the real AWS. And, in practice, to fix the first point we would have to change things in our copy of the spark-kinesis project, which is undesirable (it is better to keep our copy as close as possible to the original so that we can merge the upstream improvements into our copy).

I believe those points motivate the need for having tests that use the real AWS instead of a containerized implementation of AWS. Except for the benchmarks, such tests should not be expensive because they would not consume a lot of bandwidth.

I propose the following course of action: