Open smugryan opened 9 years ago
This is a cool idea - pull request most welcome!
:+1:
Sister ticket: https://github.com/snowplow/sluice/issues/31
Why a separate and blank ticket? (Just curious if I could be filing tickets in a better manner)
I'm pretty sure we will need to update Sluice to support IAM roles too...
:+1:
This ticket is quite old and security best practices for AWS state to not hard-code those secret keys but to use IAM roles with instance profiles or other kind of temporary tokens. I tried to track this problem down and one thing I found is the lacking support for that in Sluice as already mentioned. Any updated on this? Is it on some roadmap? Quite an important feature for us to make the security right. Together with the documentation of the IAM permissions needed which do not confirm with the principle of least privilege it is even more of a problem: https://discourse.snowplowanalytics.com/t/what-is-the-minimum-viable-iam-policy-for-snowplow-operation/192
We're planning on removing the sluice dependency and using fog-aws directly which supports iam, I think there was a ticket dedicated to the move but I can't seem to find it.
However, AFAIK elasticity, the ruby wrapper around the emr api we're using inside emr etl runner doesn't provide a way to use iam yet. So there might still be a bit of work there in the medium term.
However, in the longer term, we're planning on moving to dataflow-runner which supports IAM directly.
This is for the batch pipeline, the real-time pipeline already supports iam roles.
Awesome. Many thanks for the fast reply. That already helps us to plan ahead :)
At this point, any dev cycles we could put into improving EmrEtlRunner, we could instead put into the migration to Dataflow Runner, so most likely this won't happen. But added: https://github.com/snowplow/dataflow-runner/issues/34
Would be great to not have to put IAM user key/secrets into configuration files and instead have the EMR/ETL and storage loader tools pull that info from the EC2 role info from the instance it's living on.
This way we don't have to create a new user under our AWS account and instead assign the role to the machine that is running the EMR/ETL/storage loader jobs. Better for security and ease of setup.