skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.66k stars 492 forks source link

sky check run failing on availability zones permission, but works via aws cli #2451

Closed maxyousif15 closed 1 year ago

maxyousif15 commented 1 year ago

I'm experiencing an error whereby running sky check fails, but running aws ec2 describe-availability-zones works.

AWS version:

(base) max@Maxs-MacBook-Pro ~ % aws --version
aws-cli/1.20.29 Python/3.8.5 Darwin/22.5.0 botocore/1.21.29

Skypilot version:

(base) max@Maxs-MacBook-Pro ~ % sky --version
skypilot, version 0.3.3

AWS config information (~/.aws/config):

(base) max@Maxs-MacBook-Pro ~ % cat ~/.aws/config
[default]
region = eu-west-2
output = json

[profile live]
source_profile = default
region = eu-west-2
role_arn = arn:aws:iam::XXX:role/name
mfa_serial = arn:aws:iam::XXX:mfa/name

[profile rnd]
source_profile = default
region = eu-west-2
role_arn = arn:aws:iam::XXX:role/name
mfa_serial = arn:aws:iam::XXX:mfa/name

AWS credentials information (~/.aws/credentials):

(base) max@Maxs-MacBook-Pro ~ % cat ~/.aws/credentials 
[default]
aws_access_key_id = XXX
aws_secret_access_key = XXX

You can see the terminal outputs below

(base) max@Maxs-MacBook-Pro ~ % sky check -v          
Checking credentials to enable clouds for SkyPilot.
  Checking AWS...I 08-23 10:30:41 aws_catalog.py:78] Fetching availability zones mapping for AWS...
RuntimeError: Failed to retrieve availability zone. Please ensure that the `ec2:DescribeAvailabilityZones` action is enabled for your AWS account in IAM. Ref: https://docs.aws.amazon.com/AWSEC2/latest/APIReference/API_DescribeAvailabilityZones.html
(base) max@Maxs-MacBook-Pro ~ % aws ec2 describe-availability-zones 
{
    "AvailabilityZones": [
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "eu-west-2",
            "ZoneName": "eu-west-2a",
            "ZoneId": "euw2-az2",
            "GroupName": "eu-west-2",
            "NetworkBorderGroup": "eu-west-2",
            "ZoneType": "availability-zone"
        },
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "eu-west-2",
            "ZoneName": "eu-west-2b",
            "ZoneId": "euw2-az3",
            "GroupName": "eu-west-2",
            "NetworkBorderGroup": "eu-west-2",
            "ZoneType": "availability-zone"
        },
        {
            "State": "available",
            "OptInStatus": "opt-in-not-required",
            "Messages": [],
            "RegionName": "eu-west-2",
            "ZoneName": "eu-west-2c",
            "ZoneId": "euw2-az1",
            "GroupName": "eu-west-2",
            "NetworkBorderGroup": "eu-west-2",
            "ZoneType": "availability-zone"
        }
    ]
}

This could be an error on my side, but any insight would be much appreciated.

Also, is there a way to specify the profile to use for sky check? My initial thoughts are that maybe it is attempting to use an unauthorised profile?

maxyousif15 commented 1 year ago

After digging into the code, I've noticed that eu-central-2 is not supported. Would it be wise to add this to some documentation somewhere (if it doesn't already exist)?

TRT-BradleyB commented 1 year ago

I'm also having this issue aws ec2 describe-availability-zones works so I'm assuming sky pilot is grabbing the wrong account for whatever reason

Michaelvll commented 1 year ago

Thanks for raising the issue @maxyousif15 @TRT-BradleyB!

We merged #2456 to print more account information for the error during sky check. Could you help run sky check again to see if SkyPilot is using the same identity as the aws ec2 cli?

TRT-BradleyB commented 1 year ago

I can confirm the same account is shown in sky check, as with aws sts get-caller-identity. aws ec2 describe-availability-zones works with these details.

TRT-BradleyB commented 1 year ago

Looks like a mismatch between where enabled regions are identified here: https://github.com/skypilot-org/skypilot/blob/469a62d9ac20043f6d267edcf04f5928dae65034/sky/clouds/service_catalog/data_fetchers/fetch_aws.py#L79

And where AZs are obtained here:

https://github.com/skypilot-org/skypilot/blob/469a62d9ac20043f6d267edcf04f5928dae65034/sky/clouds/service_catalog/data_fetchers/fetch_aws.py#L419

Seems like my org has set me up s.t. all regions show as enabled but I only have describe-availability-zone policies (and presumably other policies ) on eu-west-2.

I thought I could get around this by setting my region to eu-west-2 during sky launch - but it seems like all AZs are still checked?

Imagine this is due to my orgs bad aws config but would appreciate finer control to work around it! Any possibility of additionally regions_enabled.intersection(set(ALL_REGIONS)).intersection(specified_regions) or something

I've gotten around it by changing all regions to:

ALL_REGIONS = [ 'eu-west-2' ]

And can confirm it now works - but obv this is just a temp solution

Michaelvll commented 1 year ago

Thanks for testing @TRT-BradleyB! This is very useful!

We previously assumed people will have permission across all the regions they enabled, but it seems it can be set per-region.

It should be definitely fine to skip those regions, when the _get_availability_zones fails due to the permission denied error. We just submitted a PR to fix this #2463. Would you like to try it out to see if it fixes your problem?

TRT-BradleyB commented 1 year ago

This fixes the issue, thanks a lot!

maxyousif15 commented 1 year ago

Sorry, I was away on holiday whilst this was being addressed. Yes, it seemed to me that not all regions have the correct permissions and I had the same solution as @TRT-BradleyB whereby I limited the ALL_REGIONS variable to those enabled. Skipping disabled regions seems to be sensible to me since previously the code would fall over for those not enabled. The #2463 PR seems to fix my issues, thanks!

maxyousif15 commented 1 year ago

What is the release process like? I'm currently installing version with the fix from my local clone. I'm assuming this will eventually end up in the PyPi repository?

concretevitamin commented 1 year ago

@maxyousif15 We aim to release a new minor version every ~4 months. Depending on the updates.

If you prefer, installing nightly releases is a good way to get new updates from PyPI:

pip install -U "skypilot-nightly[aws,gcp,azure,ibm,oci,scp,lambda]"  # choose your clouds
maxyousif15 commented 1 year ago

That's great, thank you. Will use the nightly approach as its easier to automate via pip than using git clone. Thanks again!