skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.53k stars 465 forks source link

VPC/subnets error when launch VM in new GCP project #2866

Closed lhqing closed 9 months ago

lhqing commented 9 months ago

Hi SkyPilot team

I'm trying to use skypilot in a new GCP project, which I am the owner.

I got this reproduceble error on starting VM, which seems to me due to VPC/subnets issue. When checking the skypilot-vpc that seems to be auto created by sky, there is only us-central1 and us-east4. In gcloud init, I tried to set default region to us-central1 which doesn't help.

$ sky --version
skypilot, version 0.4.1
$ sky launch -c demo -y ~/src/commons/bican/sky-demo.yaml
Task from YAML spec: /Users/hanqingliu/src/commons/bican/sky-demo.yaml
Running task on cluster demo...
I 12-14 09:35:23 cloud_vm_ray_backend.py:4361] The cluster 'demo' (status: INIT) was not found on the cloud: it may be autodowned, manually terminated, or its launch never succeeded. Provisioning a new cluster by using the same resources as its original launch.
I 12-14 09:35:23 cloud_vm_ray_backend.py:4380] Creating a new cluster: 'demo' [1x GCP(n2d-standard-8, disk_size=128)].
I 12-14 09:35:23 cloud_vm_ray_backend.py:4380] Tip: to reuse an existing cluster, specify --cluster (-c). Run `sky status` to see existing clusters.
I 12-14 09:35:23 cloud_vm_ray_backend.py:1449] To view detailed progress: tail -n100 -f /Users/hanqingliu/sky_logs/sky-2023-12-14-09-35-21-065573/provision.log
I 12-14 09:35:26 cloud_vm_ray_backend.py:1887] Launching on GCP us-west1 (us-west1-a)
I 12-14 09:35:33 cloud_vm_ray_backend.py:807] ====== stdout ======
2023-12-14 09:35:26,729 INFO commands.py:276 -- Cluster: demo-dba3
2023-12-14 09:35:27,030 INFO commands.py:353 -- Checking External environment settings
I 12-14 09:35:29 config.py:365] _configure_iam_role: Checking permissions for skypilot-v1@hms-dev-greenberg-4923.iam.gserviceaccount.com...

I 12-14 09:35:33 cloud_vm_ray_backend.py:810] ====== stderr ======
Dropping the empty legacy field head_node. head_nodeis not supported for ray>=2.0.0. It is recommended to removehead_node from the cluster config.
Dropping the empty legacy field worker_nodes. worker_nodesis not supported for ray>=2.0.0. It is recommended to removeworker_nodes from the cluster config.
Traceback (most recent call last):
  File "/var/folders/fv/88lmpgd95tb8kf5x2n31_d5m0000gn/T/skypilot_ray_up_rkbvgvnh.py", line 76, in <module>
    sdk.create_or_update_cluster('/Users/hanqingliu/Documents/sky/sky_dir/generated/demo.yml', **{'no_restart': True})
  File "/Users/hanqingliu/mambaforge/lib/python3.10/site-packages/ray/autoscaler/sdk/sdk.py", line 38, in create_or_update_cluster
    return commands.create_or_update_cluster(
  File "/Users/hanqingliu/mambaforge/lib/python3.10/site-packages/ray/autoscaler/_private/commands.py", line 279, in create_or_update_cluster
    config = _bootstrap_config(config, no_config_cache=no_config_cache)
  File "/Users/hanqingliu/mambaforge/lib/python3.10/site-packages/ray/autoscaler/_private/commands.py", line 381, in _bootstrap_config
    resolved_config = provider_cls.bootstrap_config(config)
  File "/Users/hanqingliu/mambaforge/lib/python3.10/site-packages/sky/skylet/providers/gcp/node_provider.py", line 369, in bootstrap_config
    return bootstrap_gcp(cluster_config)
  File "/Users/hanqingliu/mambaforge/lib/python3.10/site-packages/sky/skylet/providers/gcp/config.py", line 312, in bootstrap_gcp
    config = _configure_subnet(config, compute)
  File "/Users/hanqingliu/mambaforge/lib/python3.10/site-packages/sky/skylet/providers/gcp/config.py", line 830, in _configure_subnet
    default_subnet = subnets[0]
IndexError: list index out of range

Clusters
NAME       LAUNCHED        RESOURCES                                                                  STATUS   AUTOSTOP  COMMAND
demo       a few secs ago  1x GCP(n2d-standard-8, disk_size=128)                                      INIT     -         sky launch -c demo -y /Us...

* 1 cluster has auto{stop,down} scheduled. Refresh statuses with: sky status --refresh

RuntimeError: Errors occurred during provision; check logs above.

My VPC info:

image
Michaelvll commented 9 months ago

Thanks for reporting this issue @lhqing! This issue should be fixed in the latest nightly version. Could you try to install the latest nightly version of SkyPilot and try it again?

pip uninstall skypilot
pip install -U skypilot-nightly
lhqing commented 9 months ago

Thanks @Michaelvll

A minor bug report: I installed the nightly version, and clear all .sky and sky_logs. The new sky doesn't recognize my project. I did gcloud auth application-default login and then I can launch my VM successfully. Close the issue for now

# hanqingliu @ Hanqings-iMac in ~ [9:49:26]
$ sky check
Checking credentials to enable clouds for SkyPilot.
  AWS: disabled
    Reason: Failed to access AWS services with credentials. Make sure that the access and secret keys are correct. Run the following commands:
      $ pip install boto3
      $ aws configure
      $ aws configure list  # Ensure that this shows identity is set.
    For more info: https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html
    Details: `aws sts get-caller-identity` failed with error: [botocore.exceptions.ClientError] An error occurred (InvalidClientTokenId) when calling the GetCallerIdentity operation: The security token included in the request is invalid..
  Azure: disabled
    Reason: ~/.azure/msal_token_cache.json does not exist. Run the following commands:
      $ az login
      $ az account set -s <subscription_id>
    For more info: https://docs.microsoft.com/en-us/cli/azure/get-started-with-azure-cli
  GCP: disabled
    Reason: The following permissions are not enabled for the current GCP identity (hanqingliu@g.harvard.edu [project_id=hms-dev-greenberg-4923]):
    {'compute.instances.start', 'compute.subnetworks.list', 'compute.instances.get', 'iam.serviceAccounts.get', 'compute.disks.list', 'compute.instances.create', 'compute.instances.delete', 'compute.instances.list', 'compute.networks.getEffectiveFirewalls', 'resourcemanager.projects.getIamPolicy', 'compute.firewalls.delete', 'compute.networks.get', 'compute.subnetworks.useExternalIp', 'compute.instances.setLabels', 'serviceusage.services.enable', 'serviceusage.services.use', 'serviceusage.services.list', 'compute.subnetworks.use', 'compute.globalOperations.get', 'compute.firewalls.get', 'resourcemanager.projects.get', 'compute.firewalls.create', 'compute.instances.setServiceAccount', 'compute.instances.stop', 'compute.zoneOperations.get', 'compute.networks.list', 'iam.roles.get', 'compute.disks.create', 'compute.projects.get', 'iam.serviceAccounts.actAs'}
    For more details, visit: https://skypilot.readthedocs.io/en/latest/cloud-setup/cloud-permissions/gcp.html
  IBM: disabled
    Reason: Missing credential file at /Users/hanqingliu/.ibm/credentials.yaml.
    Store your API key and Resource Group id in ~/.ibm/credentials.yaml in the following format:
      iam_api_key: <IAM_API_KEY>
      resource_group_id: <RESOURCE_GROUP_ID>
  Kubernetes: disabled
    Reason: Credentials not found - check if ~/.kube/config exists.
  Lambda: disabled
    Reason: Failed to access Lambda Cloud with credentials. To configure credentials, go to:
      https://cloud.lambdalabs.com/api-keys
    to generate API key and add the line
      api_key = [YOUR API KEY]
    to ~/.lambda_cloud/lambda_keys
  OCI: disabled
    Reason: `oci` is not installed. Install it with: pip install oci
    For more details, refer to: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#oracle-cloud-infrastructure-oci
  SCP: disabled
    Reason: Failed to access SCP with credentials. To configure credentials, see: https://cloud.samsungsds.com/openapiguide
    Generate API key and add the following line to ~/.scp/scp_credential:
      access_key = [YOUR API ACCESS KEY]
      secret_key = [YOUR API SECRET KEY]
      project_id = [YOUR PROJECT ID]
  Cloudflare (for R2 object store): disabled
    Reason: [r2] profile is not set in ~/.cloudflare/r2.credentials. Additionally, Account ID from R2 dashboard is not set. Run the following commands:
      $ pip install boto3
      $ AWS_SHARED_CREDENTIALS_FILE=~/.cloudflare/r2.credentials aws configure --profile r2
      $ mkdir -p ~/.cloudflare
      $ echo <YOUR_ACCOUNT_ID_HERE> > ~/.cloudflare/accountid
    For more info: https://skypilot.readthedocs.io/en/latest/getting-started/installation.html#cloudflare-r2
No cloud is enabled. SkyPilot will not be able to run any task. Run `sky check` for more info.
(base)
# hanqingliu @ Hanqings-iMac in ~ [9:49:41]
$ sky launch -c demo -y ~/src/commons/bican/sky-demo.yaml
Task from YAML spec: /Users/hanqingliu/src/commons/bican/sky-demo.yaml
No cloud is enabled. SkyPilot will not be able to run any task. Run `sky check` for more info.
(base)
# hanqingliu @ Hanqings-iMac in ~ [9:50:42]
$ sky --version
skypilot, version 1.0.0.dev20231215
(base)