Closed julienrf closed 3 weeks ago
Adding tests for that would require evolving the testing infrastructure to mock AWS because we need to call the Security Token Service. I tested it locally with DynamoDB as a source database:
AccessDeniedException
when we try to read from the source databaseassumeRole
to refer to a role that I created and that allows performing any action on DynamoDB: the migration worked again.Since the logic is the same when we connect to a DynamoDB source, a DynamoDB target, and a DynamoDB S3 export as a source, I expect all of them to also work. However, the logic is a little bit different when using a Parquet source, so that one requires additional tests.
I ran a migration scenario from Parquet to Scylla to validate that this PR also works when reading from Parquet files stored on S3, but it failed.
First, note that I had to add hadoop-aws-2.6.5.jar
to the Spark cluster classpath to make it work, otherwise I was getting a ClassNotFoundException
when trying to read from s3a://…
URIs. (But this is unrelated to the PR.)
Then, it seems the authorization delegation does not work. I was able to load the Parquet file when using my own account credentials, but it didn’t work when using the assumeRole
option. It failed with a Forbidden
error when trying to read the S3 object containing the Parquet file.
I tried on the command-line (aws s3 ls …
) and I was able to access the file when using “assume role” as described here, which means that the error is not on the AWS side, but on the scylla-migrator side.
I tried a couple of variations to read Parquet files (e.g. using org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
or org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
) but nothing worked. The culprit line is this one:
I checked with my debugger, the hadoopConf
does contain all the authentication information, and yet the returned FileSystem
does not use the correct authentication credentials and can not access the resource. It seems we use a very old version of Spark and Hadoop (2.6.5), which did not yet support the ability to plug in a custom credentials provider. According to this commit, this was introduced afterwards.
The simplest path forward would be to be able to bump the Hadoop version without having to change the Spark version (because changing the Spark version would also force us to change the Scala version, which may require even more work). Otherwise, we should probably think about updating to a more recent version of Spark.
I removed the changes related to Parquet since they require Hadoop 3.x. I think we can merge the PR as it improves the way we authenticate to AWS when migrating from DynamoDB, and we can re-apply the changes related to Parquet after we upgrade to Hadoop 3.x.
thank you Julien, merging
assumeRole
optional property to the credentials configuration objectsSourceSettings
orTargetSettings
s3a://…
) is still not supported because our version of hadoop is too old (see my comment below)Relates to #149