skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.86k stars 518 forks source link

[GCP] Uncaught `CONDITION_NOT_MET` error #1843

Closed Michaelvll closed 1 year ago

Michaelvll commented 1 year ago

When launching the spot controller, the following error occurs:

W 04-08 10:29:39 cloud_vm_ray_backend.py:621] Got return code CONDITION_NOT_MET in us-central1-a (message: Labels fingerprint either invalid or resource labels have changed)

Traceback (most recent call last):
  File "/home/username/miniconda3/envs/sky-dev/bin/sky", line 33, in <module>
    sys.exit(load_entry_point('skypilot', 'console_scripts', 'sky')())
  File "/home/username/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1128, in __call__
    return self.main(*args, **kwargs)
  File "/home/username/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1053, in main
    rv = self.invoke(ctx)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 220, in _record
    return f(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/cli.py", line 1059, in invoke
    return super().invoke(ctx)
  File "/home/username/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 220, in _record
    return f(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/cli.py", line 1059, in invoke
    return super().invoke(ctx)
  File "/home/username/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1659, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/home/username/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 1395, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/home/username/miniconda3/envs/sky-dev/lib/python3.8/site-packages/click/core.py", line 754, in invoke
    return __callback(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/cli.py", line 3351, in spot_launch
    sky.spot_launch(task,
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/execution.py", line 671, in spot_launch
    _execute(
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/execution.py", line 266, in _execute
    handle = backend.provision(task,
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 220, in _record
    return f(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/backends/backend.py", line 56, in provision
    return self._provision(task, to_provision, dryrun, stream_logs,
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/backends/cloud_vm_ray_backend.py", line 2221, in _provision
    config_dict = provisioner.provision_with_retries(
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/utils/common_utils.py", line 241, in _record
    return f(*args, **kwargs)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/backends/cloud_vm_ray_backend.py", line 1719, in provision_with_retries
    config_dict = self._retry_zones(
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/backends/cloud_vm_ray_backend.py", line 1310, in _retry_zones
    definitely_no_nodes_launched = self._update_blocklist_on_error(
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/backends/cloud_vm_ray_backend.py", line 909, in _update_blocklist_on_error
    handler(launchable_resources, region, zones, stdout, stderr)
  File "/home/username/Library/CloudStorage/OneDrive-Personal/AResource/PhD/Research/sky-computing/code/sky-experiment-dev/sky/backends/cloud_vm_ray_backend.py", line 661, in _update_blocklist_on_gcp_error
    assert False, error
AssertionError: {'code': 'CONDITION_NOT_MET', 'message': 'Labels fingerprint either invalid or resource labels have changed'}

After trying to sky spot launch again, the problem seems resolved automatically

github-actions[bot] commented 1 year ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 1 year ago

This issue was closed because it has been stalled for 10 days with no activity.