transferwise / pipelinewise-target-redshift

Singer.io Target for Amazon Redshift - PipelineWise compatible
https://transferwise.github.io/pipelinewise/
Other
12 stars 65 forks source link

Avoid clobbering credentials when not using tokens #38

Closed paul-english closed 4 years ago

paul-english commented 4 years ago

Should be a fix for https://github.com/transferwise/pipelinewise-target-redshift/issues/36

paul-english commented 4 years ago

@koszti Do you know if that test will passes if aws_redshift_copy_role_arn is provided in the config? If so, the statement credentials = aws_session.get_credentials().get_frozen_credentials() might still be needed when we aren't using an arn for the copy sql. I apologize I don't know enough about boto3, but I think calling that is what leads to #36. @Limess does this seem correct?

Limess commented 4 years ago

Looks like there might be a weird race condition in terms of calling aws_session.get_credentials().get_frozen_credentials() prior to initialising an S3 client then?

Maybe we should conditionally call aws_session.get_credentials().get_frozen_credentials() immediately prior to the COPY call (boto3 caches credentials while they're valid so this should be fine and automatically refreshes them when calling get_credentials() if necessary) if no role is provided, and also use the session token from credentials in the COPY command if it exists.

The credentials will be obtained from AWS STS via an assume role call so will be temporary and require the token (token in the returned credentials object: https://www.pydoc.io/pypi/botocore-1.7.0/autoapi/credentials/index.html#credentials.Credentials).

We should probably persist the aws_session on the class rather than the configuration used to create it.

koszti commented 4 years ago

@log0ymxm , @Limess with the change below I could make every test passing and seems it's working as expected. The change doesn't let boto3 to get the supported environment variables by their own but it passes the aws_access_key_id, aws_secret_access_key and aws_session_token from config.json or from environment variables.

Every test case passes and I manually tested the following scenarios:

(Temp STS credentials requested by CLI: aws sts get-session-token --duration-seconds 900)

Can you please let me know if this works for you. If yes, then please update your PR and I'm happy to merge and sync it to pipelinewise-redshift-fastsync.

Changes in db_sync.py

        aws_access_key_id = self.connection_config.get('aws_access_key_id') or os.environ.get('AWS_ACCESS_KEY_ID')
        aws_secret_access_key = self.connection_config.get('aws_secret_access_key') or os.environ.get('AWS_SECRET_ACCESS_KEY')
        aws_session_token = self.connection_config.get('aws_session_token') or os.environ.get('AWS_SESSION_TOKEN')

        # Init S3 client
        # Conditionally pass keys as this seems to affect whether instance credentials are correctly loaded if the keys are None
        if aws_access_key_id and aws_secret_access_key:
            aws_session = boto3.session.Session(
                aws_access_key_id=aws_access_key_id,
                aws_secret_access_key=aws_secret_access_key,
                aws_session_token=aws_session_token
            )
            credentials = aws_session.get_credentials().get_frozen_credentials()

            # Explicitly set credentials to those fetched from Boto so we can re-use them in COPY SQL if necessary
            self.connection_config['aws_access_key_id'] = credentials.access_key
            self.connection_config['aws_secret_access_key'] = credentials.secret_key
            self.connection_config['aws_session_token'] = credentials.token
        else:
            aws_session = boto3.session.Session()
Limess commented 4 years ago

I'd still look to move the get_credentials() call to immediately before the Redshift COPY command as it will trigger a refresh if boto3 is getting them from instance/ECS credentials

paul-english commented 4 years ago

@koszti the change looks to still work for me, I've made those updates in my branch. @Limess I haven't made your suggested change.

paul-english commented 4 years ago

@koszti is there any update on this PR needed, do we need the updates @Limess is suggesting?

koszti commented 4 years ago

this is merged and released to pypi as pipelinewise-target-redshift 1.2.1. Further enhancements can be released by separated PRs.

@log0ymxm I synced these changes to pipelinewise-redshift-fastsync (here) but please note we can't merge new PRs to the main pipelinewise repo until 6th Jan so still need to wait a few more days but it will arrive in the very beginning of the new year.