Add support for temporary AWS credentials via AssumeRole

julienrf commented 3 weeks ago

Add assumeRole optional property to the credentials configuration objects
Compute the “final” AWS credentials to use at the level of the SourceSettings or TargetSettings
Support AssumeRole with DynamoDB source and target databases, with DynamoDB S3 exports
Reading Parquet files from S3 (using paths like s3a://…) is still not supported because our version of hadoop is too old (see my comment below)

Relates to #149

julienrf commented 3 weeks ago

Adding tests for that would require evolving the testing infrastructure to mock AWS because we need to call the Security Token Service. I tested it locally with DynamoDB as a source database:

first, with my own account credentials: the migration works
second, with a new account that I created with no rights: the migration fails with an AccessDeniedException when we try to read from the source database
third, with the new account again, but this time I also configured the property assumeRole to refer to a role that I created and that allows performing any action on DynamoDB: the migration worked again.

Since the logic is the same when we connect to a DynamoDB source, a DynamoDB target, and a DynamoDB S3 export as a source, I expect all of them to also work. However, the logic is a little bit different when using a Parquet source, so that one requires additional tests.

julienrf commented 3 weeks ago

I ran a migration scenario from Parquet to Scylla to validate that this PR also works when reading from Parquet files stored on S3, but it failed.

First, note that I had to add hadoop-aws-2.6.5.jar to the Spark cluster classpath to make it work, otherwise I was getting a ClassNotFoundException when trying to read from s3a://… URIs. (But this is unrelated to the PR.)

Then, it seems the authorization delegation does not work. I was able to load the Parquet file when using my own account credentials, but it didn’t work when using the assumeRole option. It failed with a Forbidden error when trying to read the S3 object containing the Parquet file.

I tried on the command-line (aws s3 ls …) and I was able to access the file when using “assume role” as described here, which means that the error is not on the AWS side, but on the scylla-migrator side.

I tried a couple of variations to read Parquet files (e.g. using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider or org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider) but nothing worked. The culprit line is this one:

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L547

I checked with my debugger, the hadoopConf does contain all the authentication information, and yet the returned FileSystem does not use the correct authentication credentials and can not access the resource. It seems we use a very old version of Spark and Hadoop (2.6.5), which did not yet support the ability to plug in a custom credentials provider. According to this commit, this was introduced afterwards.

The simplest path forward would be to be able to bump the Hadoop version without having to change the Spark version (because changing the Spark version would also force us to change the Scala version, which may require even more work). Otherwise, we should probably think about updating to a more recent version of Spark.

julienrf commented 3 weeks ago

I removed the changes related to Parquet since they require Hadoop 3.x. I think we can merge the PR as it improves the way we authenticate to AWS when migrating from DynamoDB, and we can re-apply the changes related to Parquet after we upgrade to Hadoop 3.x.

tarzanek commented 3 weeks ago

thank you Julien, merging

scylladb / scylla-migrator

Add support for temporary AWS credentials via AssumeRole #150