Closed datawookie closed 5 years ago
An update on the above: after doing still more reading I tried using EMR_EC2_DefaultRole
for the profile.
Now I get a little bit further:
Launching 2 instances...
An error occurred (UnauthorizedOperation) when calling the RunInstances operation: You are not authorized to perform this operation.
A further update: if I use the credentials for the administrator
user then this works fine!
So I guess that my outstanding issue is: what policy needs to be applied to the datawookie
user in order that it can also be used to launch a cluster? AmazonEC2FullAccess and AmazonS3FullAccess do not appear to be sufficient. But AdministratorAccess is complete overkill.
There are two sets of permissions we need to consider separately:
If you are running Flintrock from your workstation (e.g. laptop), then 1. is about what IAM user credentials you are using from that workstation. That controls what Flintrock can do on AWS. On the other hand, 2. is strictly about IAM roles, not users. That controls what the launched Flintrock cluster can do. I think you are mixing the two concepts and getting confused.
When Flintrock launches a cluster, it needs permission to create instances, security groups, and query IAM. I don't have a comprehensive list of the exact permissions, unfortunately, but it should be clear from the errors you're getting which one you're missing. You definitely don't need full administrator access.
From the looks of it, you want to use the datawookie
IAM user to run Flintrock. In other words, when you check the active AWS credentials on the workstation you're running Flintrock from (there are several ways the credentials might be configured), they are for that user. That user seems to be missing some permissions. Please share the full set of policies attached to datawookie
and I can help you debug it.
The IAM role you attach to the cluster via instance-profile-name
, on the other hand, is just about what you want to enable the cluster to do once it's launched and ready. This IAM role is typically how people configure S3 access for their Flintrock cluster, so that Spark can access data on S3 without needing to be passed explicit credentials. This has nothing to do with Flintrock itself. Note how in your first message you tried to use the datawookie
IAM user as an IAM role to be attached to the cluster, which doesn't make sense.
I suggest you try launching a Flintrock cluster without anything configured for instance-profile-name
, to get that out of the way for now, and focus first on getting the permissions for datawookie
right.
Does that clarify things?
Hi Nicholas,
Thanks for your detailed and helpful response. You were right: conflating the user and role ideas was part of my problem. Following your suggestion I have just focused on figuring out what policies I need to have in place in order to launch the cluster. This is what I now have for user datawookie
(and it works!):
AmazonEC2FullAccess
IAMFullAccess
AmazonS3FullAccess
These seem rather broad and I'm sure that it would be good to clamp down to more specific policies, but I am happy to have this just working!
Regarding the IAM role: is EMR_EC2_DefaultRole
the only/best option? I'm still trying to wrap my head around just exactly what a role does.
Thanks for your help! And +1 for flintrock: it really does make the process of spinning up a cluster relatively simple!
The IAM role (a.k.a instance-profile-name
) is for granting your launched cluster permission to do things. For most people, this will include permission to read from and write to certain paths on S3, but it theory it can include anything you expect the cluster to do with Amazon services, like access AWS Lambda or what have you. If you're not accessing data on S3 from your cluster, you may not need to attach an IAM role to it at all.
If the datawookie
IAM user credentials are the credentials you're running Flintrock (the tool) under, then the permissions that user needs are much narrower than what you listed:
AmazonEC2FullAccess
role. I could list out the individual EC2 permissions that would be used, but the list is long.datawookie
doesn't need any access to S3 at all. Flintrock doesn't use S3 while launching or managing clusters.IAM.GetInstanceProfile
. Flintrock definitely does not need, and should not get, full IAM access or anything like that.Aha! Okay, the roles are starting to make more sense. I do need to access S3 and if there is no compelling alternative to EMR_EC2_DefaultRole
(or reason not to use it!) then it certainly does the job and I'm happy with that!
With regards to the permissions for the user used to launch Flintrock:
AmazonEC2FullAccess
AmazonS3FullAccess
(no ill effects!)IAMFullAccess
with IAMUserSSHKeys
.This setup worked. I could not find a more granular policy than IAMUserSSHKeys
that would still allow Flintrock to do everything that it needs.
Thanks again for your help with this!
You are using the prepackaged IAM roles and policies that Amazon offers, but you can always create your own with a very fine-grained set of permissions. You could, for example, add an inline policy to datawookie
that granted access to just IAM.GetInstanceProfile
, instead of using IAMUserSSHKeys
. You can do the same for the role you attach to your Flintrock cluster: i.e. specifying limited S3 permissions instead of using the broad EMR_EC2_DefaultRole
.
In any case, glad you have things working now, and glad I could help.
Update on this thread: it turns out that IAM.GetInstanceProfile
is not sufficient. You also need to have IAM.PassRole
.
I've written a short blog post about setting up the required (minimal) IAM permissions.
Nicely done. Thanks for writing this up and sharing it @DataWookie.
Hi,
I'm having trouble following the instructions in the README for accessing S3. Actually battling to get past the first step (launching cluster with specific IAM role).
I have the following two IAM users:
administrator
(AdministratorAccess policy)datawookie
(AmazonEC2FullAccess and AmazonS3FullAccess policies)I want to launch my cluster using the later identity. This is the relevant portion of my
config.yaml
:When I attempt to launch a cluster I get the following error:
Note that this seems to mention both the
datawookie
andadministrator
profiles.I then added the
GetInstanceProfile
permissions to thedatawookie
as an inline policy. Now launching the cluster gives this error:I'm sure that there is a simple explanation for this. Do you have any ideas?
I'm running flintrock 0.10.0 on Ubuntu 18.04.
Thanks, Andrew.