add support for assume role for s3 access to parquet files

scylladb / scylla-migrator

Migrate data extract using Spark to Scylla, normally from Cassandra

Apache License 2.0

54 stars 34 forks source link

add support for assume role for s3 access to parquet files #149

Closed tarzanek closed 1 week ago

tarzanek commented 1 month ago

we should add an option to assume role for s3 access , which is defacto standard these days

It should be as easy as https://medium.com/@leythg/access-s3-using-pyspark-by-assuming-an-aws-role-9558dbef0b9e (of course rewritten to scala and proper input properties exposed in config file)

tarzanek commented 1 month ago

fwiw we should be able to use above in other s3 cases too, ev. for DynamoDB too

julienrf commented 3 weeks ago

PR #150 only fixed the issue for DynamoDB, not for Parquet files.

Re-posting my comment here:

I removed the changes related to Parquet since they require Hadoop 3.x. I think we can merge the PR as it improves the way we authenticate to AWS when migrating from DynamoDB, and we can re-apply the changes related to Parquet after we upgrade to Hadoop 3.x.

Originally posted by @julienrf in https://github.com/scylladb/scylla-migrator/issues/150#issuecomment-2163063798

guy9 commented 3 weeks ago

Thanks @julienrf , please proceed with upgrading the Hadoop and Spark versions.