skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.69k stars 494 forks source link

GCP per minute request limit quota #586

Closed Michaelvll closed 2 years ago

Michaelvll commented 2 years ago

When I tried to start a multi-node gcp cluster, during my debugging, I encountered the following error, which may indicate the GCP has some per minute request limit quota.

raceback (most recent call last):
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/ray/autoscaler/_private/updater.py", line 166, in run
    self.provider.set_node_tags(self.node_id, tags_to_set)
  File "/Users/zhwu/OneDrive/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 41, in method_with_retries
    return method(self, *args, **kwargs)
  File "/Users/zhwu/OneDrive/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 126, in set_node_tags
    node = self._get_node(node_id)
  File "/Users/zhwu/OneDrive/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 41, in method_with_retries
    return method(self, *args, **kwargs)
  File "/Users/zhwu/OneDrive/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 215, in _get_node
    self.non_terminated_nodes({})  # Side effect: updates cache
  File "/Users/zhwu/OneDrive/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 41, in method_with_retries
    return method(self, *args, **kwargs)
  File "/Users/zhwu/OneDrive/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node_provider.py", line 100, in non_terminated_nodes
    node_instances = resource.list_instances(tag_filters)
  File "/Users/zhwu/OneDrive/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node.py", line 332, in list_instances
    return self._list_instances(label_filters, non_terminated_status)
  File "/Users/zhwu/OneDrive/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/skylet/providers/gcp/node.py", line 366, in _list_instances
    response = self.resource.instances().list(
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/_helpers.py", line 131, in positional_wrapper
    return wrapped(*args, **kwargs)
  File "/Users/zhwu/miniconda3/envs/sky-dev/lib/python3.8/site-packages/googleapiclient/http.py", line 937, in execute
    raise HttpError(resp, content, uri=self.uri)
googleapiclient.errors.HttpError: <HttpError 403 when requesting https://compute.googleapis.com/compute/v1/projects/intercloud-320520/zones/us-west1-a/instances?filter=%28%28status+%3D+RUNNING%29+OR+%28status+%3D+STAGING%29+OR+%28status+%3D+PROVISIONING%29%29+AND+%28labels.ray-cluster-name+%3D+gcp-terminate-test%29&alt=json returned "Quota exceeded for quota metric 'List requests' and limit 'List requests per minute' of service 'compute.googleapis.com' for consumer 'project_number:572505469125'.". Details: "[{'message': "Quota exceeded for quota metric 'List requests' and limit 'List requests per minute' of service 'compute.googleapis.com' for consumer 'project_number:572505469125'.", 'domain': 'usageLimits', 'reason': 'rateLimitExceeded'}]">
concretevitamin commented 2 years ago

Can this be reliably reproduced?

concretevitamin commented 2 years ago

This should be fixed by #1191.