nfultz / AWR.Athena

Short R Wrapper for Athena JDBC connections
23 stars 13 forks source link

Using IAM Role with MFA for secure access #21

Open simonsays1980 opened 4 years ago

simonsays1980 commented 4 years ago

I am trying now for hours to access my Athena database (which works charmingly via PyAthena in Python).

My setup looks as follows:

Here is my code:

library( DBI )
library( AWR.Athena )
library( aws.iam )
aws.signature::use_credentials( profile = 'research' )
session = get_session_token( id  = 'arn:aws:iam::<YourAccountNumber>:mfa/<YourIAMUserName>', 
                             code = 258028,
                             role = 'arn:aws:iam::<YourAccountNumber>:role/<YourRoleName>', 
                             duration = 3600 )
Sys.setenv( AWS_SESSION_TOKEN = session$SessionToken )

con <- dbConnect( AWR.Athena::Athena(), 
                  region = 'eu-central-1',
                  S3OutputLocation = 's3://your-output-location/', 
                  Schema = 'default',
                  Workgroup = '<YourWorkGroup>' )
# Query a list of all tables.
dbListTables( con )

I also tried the dbConnect() command with aws_credentials_provider_class = "com.amazonaws.athena.jdbc.shaded.com.amazonaws.auth.EnvironmentVariableCredentialsProvider", but I had no success, either.

I think this is an important topic as secure access to Athena from R will be needed as soon as a production setting is considered.

For reference: In Python I use the following to set up a connection with Role and MFA using a profile:

conn = connect( s3_staging_dir        = 's3://your-output-location/',
                region_name           = 'eu-central-1',
                work_group            = '<YourWorkGroup>',
                profile_name          = '<YourProfileName>'  )

Does anyone has an idea on this? Very appreciated :)

nfultz commented 4 years ago

If you are trying to auth using environment variables, they need to be set before the JVM forks to a seperate process, otherwise they are only set on the R process.

Try loading the Athena library after you do Sys.setenv.

simonsays1980 commented 4 years ago

@nfultz Wow, that was the quickest reply, I every got on an issue :) And thanks for providing this package to us!

I tried your suggestion loading the packages after I call Sys.setenv(), however this somehow did not work.

What I tried during my last comment here was to use the argument use = TRUE in the aws.iam::get_session_token() function. Following the documentation in aws.iam.pdf this stores the session variables and also makes the call to Sys.setenv() redundant:

A logical (default FALSE), specifying whether to use these credentials for subsequent requests. If TRUE, any currently used credentials are stored in a package environment (if no credentials are already stored; in that case, the request will fail) and the requested tokens overwrite them in the relevant environment variables. restore_credentials() can then be used to reset environment variables based on those from the saved environment; delete_saved_credentials() deletes the credentials without restoring them.

My code looks now like:

library( DBI )
library( AWR.Athena )
library( aws.iam )
aws.signature::use_credentials( profile = 'research' )
session = get_session_token( id  = 'arn:aws:iam::<YourAccountNumber>:mfa/<YourIAMUserName>', 
                             code = 258028,
                             role = 'arn:aws:iam::<YourAccountNumber>:role/<YourRoleName>', 
                             duration = 3600,
                             use = TRUE  )

con <- dbConnect( AWR.Athena::Athena(), 
                  region = 'eu-central-1',
                  S3OutputLocation = 's3://your-output-location/', 
                  Schema = 'default',
                  Workgroup = '<YourWorkGroup>' )
# Query a list of all tables.
dbListTables( con )

This runs like a charm so far. I guess I will write a kind of wrapper to avoid putting ARN roles and MFA device IDs into code and instead relying on the ~/.aws/config file. This might be a comfortable way to walk here.

The only error messages I still receive are some errors from log4j:

log4j:ERROR Could not find value for key log4j.appender.stdout
log4j:ERROR Could not instantiate appender named "stdout".

@nfultz I am interested in your opinion on this solution I propose.

nfultz commented 4 years ago

You are brave / lucky that you got aws and AWR packages working together, they have completely different designs.

If you do write an Athena-based wrapper, I would recommend using JVM properties (see discussion in #20, for example) - IIRC they have the highest priority on the credential chain / least chance of surprising behavior. When you come up with something working, I would be open to pulling it into this package.

You may also try using KMS for secrets instead, I've used that before to some success.

The log4j errors are a bit annoying but otherwise benign. If you can post your system java info, we can try to triage it.

simonsays1980 commented 4 years ago

Thanks for your feedback! Wow, I haven't looked deeper into aws nor AWR. I got your point using the JVM properties. What I did not get is the way how to get the AWS_SESSION_TOKEN for the IAM role with the MFA code. As soon as I have this token, it can be put into the JVM properties, I guess.

Simply setting the environment variable AWS_SESSION_TOKEN after receiving it via aws.iam::get_session_token( ..., use = FALSE ) and then loading the AWR.Athena package does not work - I tried it out. Getting this token and keeping it, is the only reason why I use the aws.iam package. One could possibly use .jcall with the AssumeRoleRequest.

My Java Info is:

java version "1.8.0_131"
Java(TM) SE Runtime Environment (build 1.8.0_131-b11)
Java HotSpot(TM) 64-Bit Server VM (build 25.131-b11, mixed mode)
nfultz commented 4 years ago

Can you check the full environment (Sys.getenv()) after you do get_session_toekn(use=TRUE), but before you attach AWR.Athena? My hunch is that the aws package is setting a full set of env vars, not only the session token:

https://github.com/cloudyr/aws.iam/blob/68600b4336aa2d42e5f575bf596964c063f2b31e/R/sts.R#L145-L157

And maybe the token is not compatible with the original access keys.

See also https://docs.aws.amazon.com/AWSJavaSDK/latest/javadoc/com/amazonaws/services/securitytoken/model/GetSessionTokenResult.html#getCredentials--

That said, you could off course continue to use extra profiles like in #20 .

simonsays1980 commented 4 years ago

The aws.iam package does create three environment variables as shown in the linked code above. Next to the AWS_SESSION_TOKEN it further creates the AWS_SECRET_ACCESS_KEY and the AWS_ACCESS_KEY_ID. The latter two variables get created when calling aws.signature::use_credentials( profile = '<YourProfile>' ). The former variable (the session token) gets created when calling aws.iam::get_session_token() by sending the credentials together with the assumed role and the MFA device id together with its corresponding code.

I assume, that the AWR.Athena package then uses the already created session variables. This is a comfortable workaround though. The cleaner form would probably use the Java API to call AssumeRole and to provide therein the credentials, the role to assume, the MFA device id and the corresponding code of this device to get a session token. I have never used the Java API nor the .jcall function in R (I come from the C++ area.