skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.47k stars 460 forks source link

[AWS] Use AWS_PROFILE if set locally #2737

Open stefannica opened 10 months ago

stefannica commented 10 months ago

If the AWS_PROFILE environment variable is set on the client, this is not carried over on the AWS EC2 VM side and may cause problems executing subsequent tasks because of lack of permissions because the VM uses the default AWS configuration profile.

How to reproduce:

  1. install skypilot 0.4.0 with AWS support
  2. setup an AWS configuration profile named e.g. skypilot with credentials required by SkyPilot. To reproduce this problem properly, the default AWS configuration profile should not have these permissions
  3. run the quickstart the usual way, but pass the AWS credentials by means of the AWS_PROFILE environment variable
AWS_PROFILE=skypilot sky launch -c mycluster hello_sky.yaml

This command should work without problems:

$ AWS_PROFILE=skypilot sky launch -c mycluster hello_sky.yaml
Task from YAML spec: hello_sky.yaml
AWS: Fetching availability zones mapping...I 10-26 21:05:50 optimizer.py:682] == Optimizer ==
I 10-26 21:05:50 optimizer.py:693] Target: minimizing cost
I 10-26 21:05:50 optimizer.py:705] Estimated cost: $0.4 / hour
I 10-26 21:05:50 optimizer.py:705] 
I 10-26 21:05:50 optimizer.py:778] Considered resources (1 node):
I 10-26 21:05:50 optimizer.py:827] ------------------------------------------------------------------------------------------
I 10-26 21:05:50 optimizer.py:827]  CLOUD   INSTANCE      vCPUs   Mem(GB)   ACCELERATORS   REGION/ZONE   COST ($)   CHOSEN   
I 10-26 21:05:50 optimizer.py:827] ------------------------------------------------------------------------------------------
I 10-26 21:05:50 optimizer.py:827]  AWS     m6i.2xlarge   8       32        -              us-east-1     0.38          ✔     
I 10-26 21:05:50 optimizer.py:827] ------------------------------------------------------------------------------------------
I 10-26 21:05:50 optimizer.py:827] 
Launching a new cluster 'mycluster'. Proceed? [Y/n]: Y
I 10-26 21:06:13 cloud_vm_ray_backend.py:4237] Creating a new cluster: "mycluster" [1x AWS(m6i.2xlarge)].
I 10-26 21:06:13 cloud_vm_ray_backend.py:4237] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 10-26 21:06:13 cloud_vm_ray_backend.py:1427] To view detailed progress: tail -n100 -f /home/stefan/sky_logs/sky-2023-10-26-21-05-44-675191/provision.log
I 10-26 21:06:14 cloud_vm_ray_backend.py:1816] Launching on AWS us-east-1 (us-east-1a,us-east-1b,us-east-1c,us-east-1d,us-east-1f)
I 10-26 21:08:12 log_utils.py:45] Head node is up.
I 10-26 21:10:08 cloud_vm_ray_backend.py:1623] Successfully provisioned or found existing VM.
I 10-26 21:10:13 cloud_vm_ray_backend.py:3006] Syncing workdir (to 1 node): . -> ~/sky_workdir
I 10-26 21:10:13 cloud_vm_ray_backend.py:3014] To view detailed progress: tail -n100 -f ~/sky_logs/sky-2023-10-26-21-05-44-675191/workdir_sync.log
I 10-26 21:10:14 cloud_vm_ray_backend.py:3110] Running setup on 1 node.
Warning: Permanently added '52.90.158.106' (ECDSA) to the list of known hosts.
Running setup.
I 10-26 21:10:22 cloud_vm_ray_backend.py:3120] Setup completed.
I 10-26 21:10:29 cloud_vm_ray_backend.py:3217] Job submitted with Job ID: 1
I 10-26 19:10:31 log_lib.py:425] Start streaming logs for job 1.
INFO: Tip: use Ctrl-C to exit log streaming (task will not be killed).
INFO: Waiting for task resources on 1 node. This will block if the cluster is full.
INFO: All task resources reserved.
INFO: Reserved IPs: ['172.31.30.105']
(task, pid=25346) Hello, SkyPilot!
(task, pid=25346) # conda environments:
(task, pid=25346) #
(task, pid=25346) base                  *  /opt/conda
(task, pid=25346) pytorch                  /opt/conda/envs/pytorch
(task, pid=25346) 
INFO: Job finished (status: SUCCEEDED).
Shared connection to 52.90.158.106 closed.
I 10-26 21:10:35 cloud_vm_ray_backend.py:3250] Job ID: 1
I 10-26 21:10:35 cloud_vm_ray_backend.py:3250] To cancel the job:   sky cancel mycluster 1
I 10-26 21:10:35 cloud_vm_ray_backend.py:3250] To stream job logs:  sky logs mycluster 1
I 10-26 21:10:35 cloud_vm_ray_backend.py:3250] To view the job queue:   sky queue mycluster
I 10-26 21:10:35 cloud_vm_ray_backend.py:3373] 
I 10-26 21:10:35 cloud_vm_ray_backend.py:3373] Cluster name: mycluster
I 10-26 21:10:35 cloud_vm_ray_backend.py:3373] To log into the head VM: ssh mycluster
I 10-26 21:10:35 cloud_vm_ray_backend.py:3373] To submit a job:     sky exec mycluster yaml_file
I 10-26 21:10:35 cloud_vm_ray_backend.py:3373] To stop the cluster: sky stop mycluster
I 10-26 21:10:35 cloud_vm_ray_backend.py:3373] To teardown the cluster: sky down mycluster
Clusters
AWS: Fetching availability zones mapping...NAME       LAUNCHED     RESOURCES            STATUS  AUTOSTOP  COMMAND                       
mycluster  23 secs ago  1x AWS(m6i.2xlarge)  UP      -         sky launch -c mycluster h...  
  1. run the quickstart again. This time, the task will be blocked indefinitely waiting for the cluster to become ready:
$ AWS_PROFILE=skypilot sky launch -c mycluster hello_sky.yaml
Task from YAML spec: hello_sky.yaml
AWS: Fetching availability zones mapping...Running task on cluster mycluster...
I 10-26 21:11:08 cloud_vm_ray_backend.py:1427] To view detailed progress: tail -n100 -f /home/stefan/sky_logs/sky-2023-10-26-21-11-01-968711/provision.log
I 10-26 21:11:08 cloud_vm_ray_backend.py:1816] Launching on AWS us-east-1 (us-east-1a)
I 10-26 21:11:36 log_utils.py:45] Head node is up.
I 10-26 21:11:53 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.
I 10-26 21:11:56 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.
I 10-26 21:12:00 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.
I 10-26 21:12:04 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.
I 10-26 21:12:07 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.
I 10-26 21:12:11 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.
I 10-26 21:12:14 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.
I 10-26 21:12:18 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.
I 10-26 21:12:22 cloud_vm_ray_backend.py:2025] Waiting for ray cluster to be ready remotely.

Looking at the ray cluster logs inside the VM, there should be some AWS permissions errors due to the fact that the VM is trying to use the default AWS configuration profile instead of the one passed to the client, e.g.:

An error occurred (UnauthorizedOperation) when calling the DescribeInstances operation: You are not authorized to perform this operation. User: arn:aws:iam::xxxx:user/stefan is not authorized to perform: ec2:DescribeInstances because no identity-based policy allows the ec2:DescribeInstances action
concretevitamin commented 10 months ago

Thanks for the report @stefannica. One detail to confirm: what's the output of AWS_PROFILE=skypilot aws configure list?

Relevant code:

https://github.com/skypilot-org/skypilot/blob/e5e400ba9b2b140866cbb8c9e3e4bb88df18dd33/sky/clouds/aws.py#L524-L560

stefannica commented 10 months ago

I used skypilot as an example, but the exact contents of the AWS profile can be an IAM role or even an STS token:

stefannica commented 10 months ago

By the way, we solved a similar problem in ZenML: picking up AWS credentials from the client machine and storing them somewhere else. The relevant ZenML code doesn't explicitly look at the AWS configuration files and doesn't even use the AWS CLI Python package, just plain boto3 and botocore, which makes it easier to detach from the AWS setup and probably more portable. I'll reference this here, if you think it may help: https://github.com/zenml-io/zenml/blob/main/src/zenml/integrations/aws/service_connectors/aws_service_connector.py#L1389-L1643

concretevitamin commented 10 months ago

I used skypilot as an example, but the exact contents of the AWS profile can be an IAM role or even an STS token:

  • STS token credentials:
$ AWS_PROFILE=zenml-7f225511 aws configure list
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile           zenml-7f225511              env    ['AWS_DEFAULT_PROFILE', 'AWS_PROFILE']
access_key     ****************7U7F shared-credentials-file    
secret_key     ****************FQ7G shared-credentials-file    
    region             eu-central-1      config-file    ~/.aws/config
  • IAM role credentials:
      Name                    Value             Type    Location
      ----                    -----             ----    --------
   profile               connectors              env    ['AWS_DEFAULT_PROFILE', 'AWS_PROFILE']
access_key     ****************BG6D      assume-role    
secret_key     ****************GxVD      assume-role    
    region                us-east-1      config-file    ~/.aws/config

To confirm: Which of these 2 cases wasn't working in the original issue description?

It seems like the first case (STS -> shared-credentials-file) should be handled by our current code? If not, something needs to be changed. (E.g., maybe the temporary credentials were auto-stored at a different path than ~/.aws/credentials.)

For the second case, assume-role, it seems like we simply need to add an enum like this and mimic its handling:

https://github.com/skypilot-org/skypilot/blob/e5e400ba9b2b140866cbb8c9e3e4bb88df18dd33/sky/clouds/aws.py#L80

In either case, would love your help in patching this!

stefannica commented 10 months ago

To confirm: Which of these 2 cases wasn't working in the original issue description?

@concretevitamin neither of them were working, because they both use the AWS_PROFILE environment variable and, as I understand it, currently this is not supported.

It seems like the first case (STS -> shared-credentials-file) should be handled by our current code? If not, something needs to be changed. (E.g., maybe the temporary credentials were auto-stored at a different path than ~/.aws/credentials.)

The credentials were indeed stored in ~/.aws/credentials, but not under the default AWS profile.

For the second case, assume-role, it seems like we simply need to add an enum like this and mimic its handling:

https://github.com/skypilot-org/skypilot/blob/e5e400ba9b2b140866cbb8c9e3e4bb88df18dd33/sky/clouds/aws.py#L80

In either case, would love your help in patching this!

stefannica commented 10 months ago

In either case, would love your help in patching this!

Sure, I'd love to help out with this. I'll try to set out some time for it over the next few days.

github-actions[bot] commented 5 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

samarthpusalkar commented 1 month ago

I had the same issue, for anyone facing the issue, as credentials file is copied, use sky --env AWS_PROFILE=aws_profile_to_use ...

this sets the profile to use on cluster as env variable, and you can assign the was profile with the right permissions