snowplow-incubator / snowplow-lake-loader

Snowplow Lake Loader
Other
0 stars 2 forks source link

Custom AWS credentials provider with support for external ID #35

Closed istreeter closed 7 months ago

istreeter commented 9 months ago

For the AWS Lake Loader, we want to assume a role when writing to S3. The hadoop S3a filesystem allows many methods for authenticating, of which the org.apache.hadoop.fs.s3a.auth.AssumedRoleCredentialProvider comes closest to being what we need. Unfortunately, though, it does not allow us to configure an external id when assuming the role.

For best security practice, we want S3 bucket owners to protect their data by requiring the loader to specify a unique external id when assuming the cross-account role. So we need to implement our own credentials provider. It can be a very lightweight wrapper around a delegate credentials provider.

CapCaWork commented 9 months ago

Hi, I tested the implementation and I am getting with the CredentialsProvider the following error:

ERROR com.snowplowanalytics.snowplow.lakes.Run - 1 validation error detected: Value null at 'roleArn' failed to satisfy constraint: Member must not be null (Service: Sts, Status Code: 400, Request ID: xxxx-xxxx-xxxx-xxxxx

When I remove the following line from the application.conf it works:

"fs.s3a.aws.credentials.provider": "com.snowplowanalytics.snowplow.lakes.AssumedRoleCredentialsProviderV1"

Is there anything that needs to be adjusted for making this work with your CredentialsProvider instead of the default one?

Thanks :)

istreeter commented 9 months ago

Hi @CapCaWork -- This is a good point. I made AssumedRoleCredentialsProviderV1 the default, but actually it's not appropriate for all use cases. And moreover I have not yet created documentation for this credentials provider. Before we release this change, I will try to find a better way to set a sensible default.

As a short-term workaround, you could override the default by putting something this in your configuration hocon file:

spark: {
  conf: {
    "fs.s3a.aws.credentials.provider": "com.amazonaws.auth.InstanceProfileCredentialsProvider"
  }
}

...or pick whichever credentials provider is appropriate for you. Any value in the application.conf file can always be overridden in your configuration hocon file.

It's exciting to hear you are an early user of Lake Loader on S3! Please understand this feature is still work-in-progress, and we're still polishing it ready for its final release. Please do share any other feedback!

CapCaWork commented 9 months ago

Hi @istreeter , thanks for the quick feedback. I am aware that this feature is still WIP but it is super interesting for us, which is why we already do a first PoC with it. Otherwise, this works great, currently doing first load tests to test the scalability but nice work, thanks a lot for that!

istreeter commented 9 months ago

Hi @CapCaWork just FYI this is getting addressed in #42. It will be included in the 0.2.0 release, which should happen early Januray 2024.