ray-project / ray

Ray is a unified framework for scaling AI and Python applications. Ray consists of a core distributed runtime and a set of AI Libraries for accelerating ML workloads.
https://ray.io
Apache License 2.0
33.13k stars 5.61k forks source link

AccessDenied error when running ray submit example-full.yaml script.py #17322

Open outdoteth opened 3 years ago

outdoteth commented 3 years ago

When I run the following command:

 ray submit example-full.yaml train.py

I get this error:

2021-07-25 11:43:16,053 INFO commands.py:298 -- Checking AWS environment settings
2021-07-25 11:43:16,057 WARN commands.py:306 -- Failed to autodetect node resources: 'EC2' object has no attribute 'describe_instance_types'. You can see full stack trace with higher verbosity.
2021-07-25 11:43:16,061 WARN util.py:124 -- The `head_node` field is deprecated and will be ignored. Use `head_node_type` and `available_node_types` instead.
2021-07-25 11:43:16,160 PANIC utils.py:102 -- Failed to fetch IAM instance profile data for ray-autoscaler-v1 from AWS.
Error code: AccessDenied
2021-07-25 11:43:16,161 ERR utils.py:104 -- !!! Boto3 error:
2021-07-25 11:43:16,161 PANIC utils.py:106 -- An error occurred (AccessDenied) when calling the GetInstanceProfile operation: User: arn:aws:sts::157977474198:assumed-role/ray-autoscaler-v1/i-079a3ece71e0e6e9c is not authorized to perform: iam:GetInstanceProfile on resource: instance profile ray-autoscaler-v1
2021-07-25 11:43:16,161 ERR utils.py:106 -- !!!
2021-07-25 11:43:16,161 ERROR syncer.py:190 -- Sync execution failed.
Traceback (most recent call last):
  File "/home/ray/anaconda3/lib/python3.7/site-packages/ray/autoscaler/_private/aws/config.py", line 734, in _get_instance_profile
    profile.load()
  File "/home/ray/anaconda3/lib/python3.7/site-packages/boto3/resources/factory.py", line 505, in do_action
    response = action(self, *args, **kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/boto3/resources/action.py", line 83, in __call__
    response = getattr(parent.meta.client, operation_name)(**params)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/botocore/client.py", line 324, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/ray/anaconda3/lib/python3.7/site-packages/botocore/client.py", line 622, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (AccessDenied) when calling the GetInstanceProfile operation: User: arn:aws:sts::157977474198:assumed-role/ray-autoscaler-v1/i-079a3ece71e0e6e9c is not authorized to perform: iam:GetInstanceProfile on resource: instance profile ray-autoscaler-v1

I have put my credentials in ~/.aws/credentials file and they are completely valid so I'm not sure why I'm getting this error. My credentials file looks like this (but with valid credentials):

[default]
aws_access_key_id = 123
aws_secret_access_key = 123
outdoteth commented 3 years ago

Fixed by adding IAMReadOnlyAccess policy to ray-autoscaler-v1 role.

vladfi1 commented 2 years ago

I am still seeing this issue in ray 1.8.0 when I run ray exec cluster.yaml --start --stop 'echo "hello world"'.

Failed to fetch IAM instance profile data for ray-autoscaler-v1 from AWS.
Error code: AccessDenied

!!! Boto3 error:
An error occurred (AccessDenied) when calling the GetInstanceProfile operation: User: arn:aws:sts::<number>:assumed-role/ray-autoscaler-v1/i-<number> is not authorized to perform: iam:GetInstanceProfile on resource: instance profile ray-autoscaler-v1
!!!

The error seems to occur when ray tries to tear down the cluster. Indeed, if I remove the --stop then there's no error. Shutting down manually with ray down cluster.yaml works fine.

DmitriGekhtman commented 2 years ago

It seems possible permissions for the default autoscaler role are not correctly configured. Reopening this to remind us to look into this.

vladfi1 commented 2 years ago

Curious if any progress has been made on this. If it helps, ray's own example-full.yaml works to reproduce.

vladfi1 commented 12 months ago

Still running into this in ray 2.7.

anyscalesam commented 3 months ago

Recent enough to garner looking into; @jjyao let's look into it as part of next week's regular weekly core GH triage.

jjyao commented 3 months ago

Hi @vladfi1, I tried ray up python/ray/autoscaler/aws/example-full.yaml with latest Ray and it worked for me. Could you try to run that command using latest Ray and see what errors you get?

vladfi1 commented 3 months ago

@jjyao The issue for me is with --stop, for example

ray exec example-full.yaml --start --stop 'echo "hello world"'
jjyao commented 3 months ago

@vladfi1 I see. We will look into this but currently ray exec is low priority for us so it might take a while to fix it. In the meantime, it's recommended to not use ray exec but ray job submission or ssh to the node directly.