skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[Serve] Temporary failure: infinite retry on GCP `compute.images.useReadOnly` permission error #4329

Open andylizf opened 1 week ago

andylizf commented 1 week ago

When running sky serve up examples/serve/http_server/task.yaml -n new-http --cloud gcp, the command keeps retrying due to a GCP permission error:

Required 'compute.images.useReadOnly' permission for 'projects/sky-dev-465/global/images/skypilot-gcp-cpu-ubuntu-20241017184242'

The command retries indefinitely. This is likely a temporary issue with GCP permissions.

Partial Logs
D 11-10 21:02:03 provisioner.py:135] SkyPilot version: 1.0.0-dev0; commit: 1f25cd36cd76e7f3380f2cb80d0c33a1cf632f94
D 11-10 21:02:03 provisioner.py:137] 
D 11-10 21:02:03 provisioner.py:137] 
D 11-10 21:02:03 provisioner.py:137] ==================== Provisioning ====================
D 11-10 21:02:03 provisioner.py:137] 
D 11-10 21:02:03 provisioner.py:138] Provision config:
D 11-10 21:02:03 provisioner.py:138] {
D 11-10 21:02:03 provisioner.py:138]   "provider_config": {
D 11-10 21:02:03 provisioner.py:138]     "type": "external",
D 11-10 21:02:03 provisioner.py:138]     "module": "sky.provision.gcp",
D 11-10 21:02:03 provisioner.py:138]     "region": "us-central1",
D 11-10 21:02:03 provisioner.py:138]     "availability_zone": "us-central1-a",
D 11-10 21:02:03 provisioner.py:138]     "cache_stopped_nodes": true,
D 11-10 21:02:03 provisioner.py:138]     "project_id": "psychic-order-437203-r7",
D 11-10 21:02:03 provisioner.py:138]     "firewall_rule": "sky-ports-sky-serve-controller-6eabc0cb-6eab",
D 11-10 21:02:03 provisioner.py:138]     "use_internal_ips": false,
D 11-10 21:02:03 provisioner.py:138]     "force_enable_external_ips": false,
D 11-10 21:02:03 provisioner.py:138]     "disable_launch_config_check": true,
D 11-10 21:02:03 provisioner.py:138]     "use_managed_instance_group": false
D 11-10 21:02:03 provisioner.py:138]   },
D 11-10 21:02:03 provisioner.py:138]   "authentication_config": {
D 11-10 21:02:03 provisioner.py:138]     "ssh_user": "gcpuser",
D 11-10 21:02:03 provisioner.py:138]     "ssh_private_key": "~/.ssh/sky-key"
D 11-10 21:02:03 provisioner.py:138]   },
D 11-10 21:02:03 provisioner.py:138]   "docker_config": {},
D 11-10 21:02:03 provisioner.py:138]   "node_config": {
D 11-10 21:02:03 provisioner.py:138]     "labels": {
D 11-10 21:02:03 provisioner.py:138]       "skypilot-user": "andyl",
D 11-10 21:02:03 provisioner.py:138]       "use-managed-instance-group": "0"
D 11-10 21:02:03 provisioner.py:138]     },
D 11-10 21:02:03 provisioner.py:138]     "machineType": "n2-standard-4",
D 11-10 21:02:03 provisioner.py:138]     "disks": [
D 11-10 21:02:03 provisioner.py:138]       {
D 11-10 21:02:03 provisioner.py:138]         "boot": true,
D 11-10 21:02:03 provisioner.py:138]         "autoDelete": true,
D 11-10 21:02:03 provisioner.py:138]         "type": "PERSISTENT",
D 11-10 21:02:03 provisioner.py:138]         "initializeParams": {
D 11-10 21:02:03 provisioner.py:138]           "diskSizeGb": 200,
D 11-10 21:02:03 provisioner.py:138]           "sourceImage": "projects/sky-dev-465/global/images/skypilot-gcp-cpu-ubuntu-20241017184242",
D 11-10 21:02:03 provisioner.py:138]           "diskType": "zones/us-central1-a/diskTypes/pd-balanced"
D 11-10 21:02:03 provisioner.py:138]         }
D 11-10 21:02:03 provisioner.py:138]       }
D 11-10 21:02:03 provisioner.py:138]     ],
D 11-10 21:02:03 provisioner.py:138]     "metadata": {
D 11-10 21:02:03 provisioner.py:138]       "items": [
D 11-10 21:02:03 provisioner.py:138]         {
D 11-10 21:02:03 provisioner.py:138]           "key": "ssh-keys",
D 11-10 21:02:03 provisioner.py:138]           "value": "gcpuser:ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAABAQC6nCRL1m/qbnjCm/9uF91bsQXtyMmDB/JCvHBs19bKJfTa5N/lW+SetSqKox+63QIuH2hfK9x7cs5a4BWLDmGFfXg/PobmcY31jv6hlM8oaXwulJqQnW7oww0SdjlFrJ5XjMtm2eFAZ5r85NGPgEI8PcvwzUqGkPqhsrYYY7hMG5A/WfSFMSkZGoRMjkxo+mpHSV08SyzI/xO7kTYuA7GUs9VbrErODptxiSWiisD39MUUiAKtU7kCVRKw4iE8KnWb0vwiZN4Skkg9yDMf9sr8iAQmR2y9RvyY3JtxmgGosTMWGZ0E5oosyLEUbHXsa++u2alAhKDqfn3jXAaCfEUd"
D 11-10 21:02:03 provisioner.py:138]         }
D 11-10 21:02:03 provisioner.py:138]       ]
D 11-10 21:02:03 provisioner.py:138]     }
D 11-10 21:02:03 provisioner.py:138]   },
D 11-10 21:02:03 provisioner.py:138]   "count": 1,
D 11-10 21:02:03 provisioner.py:138]   "tags": {},
D 11-10 21:02:03 provisioner.py:138]   "resume_stopped_nodes": true,
D 11-10 21:02:03 provisioner.py:138]   "ports_to_open_on_launch": null
D 11-10 21:02:03 provisioner.py:138] }
D 11-10 21:02:03 config.py:117] gcp_credentials not found in cluster yaml file. Falling back to GOOGLE_APPLICATION_CREDENTIALS environment variable.
I 11-10 21:02:06 config.py:217] _configure_iam_role: Checking permissions for skypilot-v1@psychic-order-437203-r7.iam.gserviceaccount.com...
I 11-10 21:02:07 config.py:613] get_usable_vpc: Found a usable VPC network 'default'.
I 11-10 21:02:09 instance.py:212] []
D 11-10 21:02:09 instance_utils.py:802] Launching GCP instances with "bulkInsert" ...
D 11-10 21:02:10 instance_utils.py:851] create_instances: googleapiclient.errors.HttpError: 
W 11-10 21:02:10 instance_utils.py:112] Got return code 'forbidden' in us-central1-a: "Required 'compute.images.useReadOnly' permission for 'projects/sky-dev-465/global/images/skypilot-gcp-cpu-ubuntu-20241017184242'"
D 11-10 21:02:10 provisioner.py:150] Failed to provision 'sky-serve-controller-6eabc0cb' on GCP (us-central1-a).
D 11-10 21:02:10 provisioner.py:152] bulk_provision for 'sky-serve-controller-6eabc0cb' failed. Stacktrace:
D 11-10 21:02:10 provisioner.py:152] Traceback (most recent call last):
D 11-10 21:02:10 provisioner.py:152]   File "/home/andyl/skypilot/sky/provision/provisioner.py", line 141, in bulk_provision
D 11-10 21:02:10 provisioner.py:152]     return _bulk_provision(cloud, region, cluster_name,
D 11-10 21:02:10 provisioner.py:152]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
D 11-10 21:02:10 provisioner.py:152]   File "/home/andyl/skypilot/sky/provision/provisioner.py", line 63, in _bulk_provision
D 11-10 21:02:10 provisioner.py:152]     provision_record = provision.run_instances(provider_name,
D 11-10 21:02:10 provisioner.py:152]                        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
D 11-10 21:02:10 provisioner.py:152]   File "/home/andyl/skypilot/sky/provision/__init__.py", line 50, in _wrapper
D 11-10 21:02:10 provisioner.py:152]     return impl(*args, **kwargs)
D 11-10 21:02:10 provisioner.py:152]            ^^^^^^^^^^^^^^^^^^^^^
D 11-10 21:02:10 provisioner.py:152]   File "/home/andyl/skypilot/sky/provision/gcp/instance.py", line 360, in run_instances
D 11-10 21:02:10 provisioner.py:152]     return _run_instances(region, cluster_name_on_cloud, config)
D 11-10 21:02:10 provisioner.py:152]            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
D 11-10 21:02:10 provisioner.py:152]   File "/home/andyl/skypilot/sky/provision/gcp/instance.py", line 301, in _run_instances
D 11-10 21:02:10 provisioner.py:152]     raise error
D 11-10 21:02:10 provisioner.py:152] sky.provision.common.ProvisionerError: Failed to launch instances.
D 11-10 21:02:10 provisioner.py:152] 
D 11-10 21:02:10 provisioner.py:157] Stopping the failed cluster.
D 11-10 21:02:10 instance.py:36] handlers: []
D 11-10 21:02:11 instance.py:47] handler_to_instances: defaultdict(, {})
D 11-10 21:02:11 instance.py:36] handlers: dict_keys([])
D 11-10 21:02:11 instance.py:47] handler_to_instances: defaultdict(, {})
D 11-10 21:02:51 provisioner.py:135] SkyPilot version: 1.0.0-dev0; commit: 1f25cd36cd76e7f3380f2cb80d0c33a1cf632f94
D 11-10 21:02:51 provisioner.py:137] 
D 11-10 21:02:51 provisioner.py:137]