scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra
Apache License 2.0
54 stars 34 forks source link

Add support for temporary AWS credentials via AssumeRole #150

Closed julienrf closed 3 weeks ago

julienrf commented 3 weeks ago

Relates to #149

julienrf commented 3 weeks ago

Adding tests for that would require evolving the testing infrastructure to mock AWS because we need to call the Security Token Service. I tested it locally with DynamoDB as a source database:

Since the logic is the same when we connect to a DynamoDB source, a DynamoDB target, and a DynamoDB S3 export as a source, I expect all of them to also work. However, the logic is a little bit different when using a Parquet source, so that one requires additional tests.

julienrf commented 3 weeks ago

I ran a migration scenario from Parquet to Scylla to validate that this PR also works when reading from Parquet files stored on S3, but it failed.

First, note that I had to add hadoop-aws-2.6.5.jar to the Spark cluster classpath to make it work, otherwise I was getting a ClassNotFoundException when trying to read from s3a://… URIs. (But this is unrelated to the PR.)

Then, it seems the authorization delegation does not work. I was able to load the Parquet file when using my own account credentials, but it didn’t work when using the assumeRole option. It failed with a Forbidden error when trying to read the S3 object containing the Parquet file.

I tried on the command-line (aws s3 ls …) and I was able to access the file when using “assume role” as described here, which means that the error is not on the AWS side, but on the scylla-migrator side.

I tried a couple of variations to read Parquet files (e.g. using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider or org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider) but nothing worked. The culprit line is this one:

https://github.com/apache/spark/blob/7955b3962ac46b89564e0613db7bea98a1478bf2/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/DataSource.scala#L547

I checked with my debugger, the hadoopConf does contain all the authentication information, and yet the returned FileSystem does not use the correct authentication credentials and can not access the resource. It seems we use a very old version of Spark and Hadoop (2.6.5), which did not yet support the ability to plug in a custom credentials provider. According to this commit, this was introduced afterwards.

The simplest path forward would be to be able to bump the Hadoop version without having to change the Spark version (because changing the Spark version would also force us to change the Scala version, which may require even more work). Otherwise, we should probably think about updating to a more recent version of Spark.

julienrf commented 3 weeks ago

I removed the changes related to Parquet since they require Hadoop 3.x. I think we can merge the PR as it improves the way we authenticate to AWS when migrating from DynamoDB, and we can re-apply the changes related to Parquet after we upgrade to Hadoop 3.x.

tarzanek commented 3 weeks ago

thank you Julien, merging