snowplow / dataflow-runner

Run templatable playbooks of Hadoop/Spark/et al jobs on Amazon EMR
http://snowplowanalytics.com
19 stars 8 forks source link

Add support of default AWS credentials chain #36

Closed chuwy closed 4 years ago

chuwy commented 6 years ago

Right now we can use:

"credentials": {
  "accessKeyId": "env",
  "secretAccessKey": "env"
}

To extract credentials from environment variables, but we also can use more default approach using default AWS profile etc.

davehowell commented 4 years ago

For use on ECS Fargate is a good reason to support this: https://andrew.hawker.io/writings/2020/02/10/snowplow-fargate-task-role/

davehowell commented 4 years ago

I've tried setting this up in Fargate. I notice that the code already supports using iam instead of env in GetProviderCredentials()

I tried using iam but it results in this error:

error: invalid memory address or nil pointer dereference

full error message:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x89b9d6]
goroutine 22 [running]:
github.com/aws/aws-sdk-go/aws/ec2metadata.(*EC2Metadata).GetMetadata(0x0, 0xc47452, 0x19, 0x0, 0x0, 0x0, 0x0)
    /home/travis/gopath/src/github.com/aws/aws-sdk-go/aws/ec2metadata/api.go:60 +0x146
github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds.requestCredList(0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
    /home/travis/gopath/src/github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds/ec2_role_provider.go:134 +0x58
github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds.(*EC2RoleProvider).Retrieve(0xc0002e4510, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0, ...)
    /home/travis/gopath/src/github.com/aws/aws-sdk-go/aws/credentials/ec2rolecreds/ec2_role_provider.go:90 +0x6a
github.com/aws/aws-sdk-go/aws/credentials.(*Credentials).singleRetrieve(0xc0002e4540, 0x0, 0x0, 0x0, 0x0)
    /home/travis/gopath/src/github.com/aws/aws-sdk-go/aws/credentials/credentials.go:251 +0x29f
github.com/aws/aws-sdk-go/internal/sync/singleflight.(*Group).doCall(0xc0002e4550, 0xc0002ec7e0, 0x0, 0x0, 0xc0002d5020)
    /home/travis/gopath/src/github.com/aws/aws-sdk-go/internal/sync/singleflight/singleflight.go:97 +0x2e
created by github.com/aws/aws-sdk-go/internal/sync/singleflight.(*Group).DoChan
    /home/travis/gopath/src/github.com/aws/aws-sdk-go/internal/sync/singleflight/singleflight.go:90 +0x2b4

There is some useful information in this aws forum reply

the instance metadata endpoint is not available for Fargate. Instead, you may make use of Task Metadata Endpoint... if you wish to make use of Task Role credentials, you may make use of the AWS_CONTAINER_CREDENTIALS_RELATIVE_URI environment variable the credentials URI is passed to PID 1 of the container only. So, if you are running your application as a child process, the application does not have access to the Task Role relative URI variable ...

as a workaround you can add this export AWS_CONTAINER_CREDENTIALS_RELATIVE_URI in your wrapper / initd / docker-entrypoint script.

At this stage I have forked the repo and looking to see if I can add the default credentials support because all of the AWS documentation I have read suggest it should just work with a new session, which should default to using the credentials provider chain, which in the context of ECS/Fargate should discover the task metadata endpoint. I have a suspicion that it is the explicit use of the ec2_role_provider that is not working with fargate.

If the default works then my feeling is that the GetCredentialsProvider function is redundant. If you want to provide environment variables to an EC2 instance or ECS/Fargate task then those environment variables will take priority in the credentials provider chain, and if they don't exist it then discovers IAM roles, so specifying env or iam would be tautology. The other use-case, explicitly providing a hardcoded (or even templated) AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY is probably not ideal from a security point of view so shouldn't be provisioned for in the code. There are plenty of other ways to inject environment variables into either EC2 or ECS or Lambda functions anyway, no need to support it via a custom config. There is also another environment variable that can be provided to tell the credentials chain to look at the shared credentials profile in ~/.aws/credentials but that also works with the default chain so doesn't need to be handled explicitly.

On the other hand, these properties in the cluster and playbook config files are part of the iglu schemas so removing them would imply a new schema version, which may not be worth it. If that's the case it's easy to just add another case to the GetCredentialsProvider function for default.

I am happy to look at a PR for this, I've already set up the dev environment and found a few issues in the vagrant build as mentioned in https://github.com/snowplow/dataflow-runner/issues/62 which I would also include. @chuwy I would appreciate your feedback and guidance on this. For now I am making some changes and testing anyway but I'd prefer to get a PR merged so I don't have to run a custom fork.

davehowell commented 4 years ago

I've made changes and added tests that are passing, I will test the executable in my environment and then submit a PR.

I note that two unrelated tests failed before I made any changes, which I know how to fix. These are testing hashicorp consul related features:

  1. TestConsulLock
  2. TestGetLock

Both errors are happening in the makeClient function in lock_test.go on line 33.

FAIL: TestGetLock (2.02s)
    server.go:278: CONFIG JSON: {"node_name":"node-796518e7-0a25-b7b4-127b-bfb672a0efe3","node_id":"796518e7-0a25-b7b4-127b-bfb672a0efe3","performance":{"raft_multiplier":1},"bootstrap":true,"server":true,"data_dir":"/tmp/TestGetLock691941124/data","segments":null,"disable_update_check":true,"log_level":"debug","bind_addr":"127.0.0.1","addresses":{},"ports":{"dns":30427,"http":30428,"https":30429,"serf_lan":30430,"serf_wan":30431,"server":30432},"acl_enforce_version_8":false,"acl":{"tokens":{}},"connect":{"ca_config":{"cluster_id":"11111111-2222-3333-4444-555555555555"},"enabled":true}}
    io.go:404: ==> Error decoding '/tmp/TestGetLock691941124/config.json': Config has invalid keys: segments,connect,acl
    io.go:404:
    lock_test.go:33: api unavailable

I manually installed the latest version of consul (1.7.2) into the vagrant, and ran make test again and all of the tests pass.

PASS
coverage: 72.6% of statements 

The fix for this will be

  1. add a new playbook to the ansible-playbooks repo for that consul version - I submitted a PR for that https://github.com/snowplow/ansible-playbooks/pull/135
  2. When that is merged, edit the up.playbooks in this repo to refer to 1.7.2

That is not a long term fix, however. I think that this code should probably use go mod init and generate the dependency files.

davehowell commented 4 years ago

PR added https://github.com/snowplow/dataflow-runner/pull/63

davehowell commented 4 years ago

@chuwy It would be awesome if someone at Snowplow could review the PR. We've been running a fork with that since April, using ECS/Fargate, and would prefer to run the official version.