ray-project / ray

Ray is an AI compute engine. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.71k stars 5.73k forks source link

[autoscaler] Ray Cluster Launcher on AWS | Minimizing Permissions #9327

Open VishDev12 opened 4 years ago

VishDev12 commented 4 years ago

This (non) issue takes a brief look at how we can minimize the permissions granted to the Ray Cluster Launcher when using it with AWS.

The cluster launcher works by launching a single head node and using that node to launch the cluster’s worker nodes. If you’re using the launcher with AWS for the first time, an Instance Profile is auto-created and a role with full EC2 and S3 permissions is attached to it; this role also has the sts:AssumeRole permission.

This works seamlessly for basic use-cases, but if you need to grant AWS permissions to the worker nodes – to allow them to access S3, for example – you’re going to need to make a few changes. While we’re doing that, let’s also trim down the EC2 and S3 permissions granted to the head node.

Example Use Case

Let’s say we need a setup that has the following properties:

Breakdown

Steps

1. Create an IAM role to assign to the head node

Role name: ray-head-v1

If you create this role for EC2 on the AWS console, an instance profile will be automatically created.

If you create this role using the AWS CLI, then create an instance profile of the same name and assign the role to it as below.

aws iam create-instance-profile --instance-profile-name ray-head-v1
aws iam add-role-to-instance-profile --instance-profile-name ray-head-v1 --role-name ray-head-v1

The AWS console page for this role will also list the ARN for the instance profile. Or to access it with the CLI:

aws iam list-instance-profiles | grep ray-head-v1

2. Create an IAM role to assign to the worker node

Role name: ray-worker-v1

Follow the same procedure as the previous step.

3. Create an IAM policy that will allow EC2 instance launches

Policy name: ray-ec2-launcher

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "ec2:RunInstances",
            "Resource": "arn:aws:ec2:us-west-1::image/ami-*"
        },
        {
            "Effect": "Allow",
            "Action": "ec2:RunInstances",
            "Resource": [
                "arn:aws:ec2:us-west-1:<aws-account-number>:instance/*",
                "arn:aws:ec2:us-west-1:<aws-account-number>:network-interface/*",
                "arn:aws:ec2:us-west-1:<aws-account-number>:subnet/*",
                "arn:aws:ec2:us-west-1:<aws-account-number>:key-pair/*",
                "arn:aws:ec2:us-west-1:<aws-account-number>:volume/*",
                "arn:aws:ec2:us-west-1:<aws-account-number>:security-group/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:TerminateInstances",
                "ec2:DeleteTags",
                "ec2:StartInstances",
                "ec2:CreateTags",
                "ec2:StopInstances"
            ],
            "Resource": "arn:aws:ec2:us-west-1:<aws-account-number>:instance/*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:Describe*"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
              "arn:aws:iam::<aws-account-number>:instance-profile/ray-head-v1",
              "arn:aws:iam::<aws-account-number>:instance-profile/ray-worker-v1"
            ]
        }
    ]
}

4. Create a policy to access the S3 bucket

Policy name: ray-s3-access

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "s3:*"
            ],
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::ray-data/*",
                "arn:aws:s3:::ray-data",
            ]
        }
    ]
}

5. Assign both of the above policies to the ray-head-v1 role

You can do this either through the AWS console interactively or using the CLI with:

aws iam attach-role-policy --policy-arn arn:aws:iam::<aws-account-number>:policy/ray-ec2-launcher --role-name ray-head-v1
aws iam attach-role-policy --policy-arn arn:aws:iam::<aws-account-number>:policy/ray-s3-access --role-name ray-head-v1

6. Assign the S3 access policy to the ray-worker-v1 role

7. Assign the ray-ec2-launcher policy to a launchpad role/user

This can optionally be done to limit the permissions assigned to the role/user that will be operating the Ray cluster launcher. For example, if you’re an AWS administrator and need to allow one of your users to (only) launch Ray clusters.

8. Edit your cluster config YAML file

Under head_node:, add:

IamInstanceProfile:
  Arn: arn:aws:iam::<aws-account-number>:instance-profile/ray-head-v1

Under worker_nodes:, add:

IamInstanceProfile:
  Arn: arn:aws:iam::<aws-account-number>:instance-profile/ray-worker-v1

Summary

While the ray-ec2-launcher policy has reduced permissions compared to the original, it’s still possible to whittle this down further by specifying the AMIs, subnets, key-pairs, etc that the cluster launcher is allowed to access, as opposed to using a wildcard.

richardliaw commented 4 years ago

Maybe we can post this to the docs somewhere?

VishDev12 commented 4 years ago

Yeah, that sounds like a good idea; do you mean something like linking this issue from there?

stale[bot] commented 3 years ago

Hi, I'm a bot from the Ray team :)

To help human contributors to focus on more relevant issues, I will automatically add the stale label to issues that have had no activity for more than 4 months.

If there is no further activity in the 14 days, the issue will be closed!

You can always ask for help on our discussion forum or Ray's public slack channel.

richardliaw commented 3 years ago

Yeah, maybe this should be added to this page: https://docs.ray.io/en/master/cluster/aws-tips.html#aws-cluster

WillCodeCo commented 3 years ago

This almost worked for me but I needed to change the ARNs for the iam::PassRole to be:

{
...
        {
            "Effect": "Allow",
            "Action": [
                "iam:PassRole"
            ],
            "Resource": [
                "arn:aws:iam::<aws-account-number>:role/ray-head-v1",
                "arn:aws:iam::<aws-account-number>:role/ray-worker-v1"
            ]
        }
}
mlubej commented 2 years ago

If you're running this and want to spawn a cluster of SPOT instances and in the slim chance that you don't have the service-linked-role for creating SPOT instances (e.g. because you don't have access to a root user), you should do:

aws iam create-service-linked-role --aws-service-name spot.amazonaws.com
zahababu commented 1 year ago

Maybe we can post this to the docs somewhere?

is there a good beginner guide of ray

Michalos88 commented 9 months ago

This should be put somewhere in docs. I find the default permissions too open for a production setting.

Mystorius commented 9 months ago

Hey, I am facing a problem with the above mentioned guide.

If i provide the following configuration for my worker nodes, once the ray up config.yaml --yes command is done, then I attached to the cluster I get the error: ray.worker.default: UnauthorizedOperation However the role I used to authenticate with AWS has AdministratorAcess.

node_config: InstanceType: t3.micro IamInstanceProfile: Arn: arn:aws:iam::OURAWSID:instance-profile/ray-worker-v1

However for the head node, everything works fine. IamInstanceProfile: Arn: arn:aws:iam::OURAWSID:instance-profile/ray-head-v1

If i do not provide any Arn configuration to the worker node, the cluster also starts without any problems. However all created worker nodes have no IAM role attached to it. If I look at the Worker-Node EC2 instance, under "IAM Role" I get no attached role therefore my worker nodes are not able to access my S3 storage for example.

Does anyone have an Idea why the config is not working?

PS: The only workaround is to manually set the IAM role on the worker using the EC2 portal i.e. Actions -> Security -> Modify IAM role -> select ray-worker-v1 but that is not really a viable solution.

ronaldo-valente-sgpiu commented 4 months ago

Is there a similar solution for Fargate PODs, especially when the RayCluster is being automatically created by a RayJob yaml?