Closed istreeter closed 7 months ago
Hi, I tested the implementation and I am getting with the CredentialsProvider the following error:
ERROR com.snowplowanalytics.snowplow.lakes.Run - 1 validation error detected: Value null at 'roleArn' failed to satisfy constraint: Member must not be null (Service: Sts, Status Code: 400, Request ID: xxxx-xxxx-xxxx-xxxxx
When I remove the following line from the application.conf it works:
"fs.s3a.aws.credentials.provider": "com.snowplowanalytics.snowplow.lakes.AssumedRoleCredentialsProviderV1"
Is there anything that needs to be adjusted for making this work with your CredentialsProvider instead of the default one?
Thanks :)
Hi @CapCaWork -- This is a good point. I made AssumedRoleCredentialsProviderV1
the default, but actually it's not appropriate for all use cases. And moreover I have not yet created documentation for this credentials provider. Before we release this change, I will try to find a better way to set a sensible default.
As a short-term workaround, you could override the default by putting something this in your configuration hocon file:
spark: {
conf: {
"fs.s3a.aws.credentials.provider": "com.amazonaws.auth.InstanceProfileCredentialsProvider"
}
}
...or pick whichever credentials provider is appropriate for you. Any value in the application.conf
file can always be overridden in your configuration hocon file.
It's exciting to hear you are an early user of Lake Loader on S3! Please understand this feature is still work-in-progress, and we're still polishing it ready for its final release. Please do share any other feedback!
Hi @istreeter , thanks for the quick feedback. I am aware that this feature is still WIP but it is super interesting for us, which is why we already do a first PoC with it. Otherwise, this works great, currently doing first load tests to test the scalability but nice work, thanks a lot for that!
Hi @CapCaWork just FYI this is getting addressed in #42. It will be included in the 0.2.0 release, which should happen early Januray 2024.
For the AWS Lake Loader, we want to assume a role when writing to S3. The hadoop S3a filesystem allows many methods for authenticating, of which the
org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider
comes closest to being what we need. Unfortunately, though, it does not allow us to configure an external id when assuming the role.For best security practice, we want S3 bucket owners to protect their data by requiring the loader to specify a unique external id when assuming the cross-account role. So we need to implement our own credentials provider. It can be a very lightweight wrapper around a delegate credentials provider.