skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.66k stars 492 forks source link

[k8s] Ambiguity when GPU labels overlap with an existing accelerator #3562

Open romilbhardwaj opened 4 months ago

romilbhardwaj commented 4 months ago

User reported on a cluster manually labeled with GPUs as NVIDIA-RTX-A6000:

SKYPILOT_DEBUG=1 sky launch --cloud kubernetes --gpus a6000:1 ./mistral.yaml
D 05-17 10:20:50 skypilot_config.py:136] Using config path: /home/amgmt/.sky/config.yaml
D 05-17 10:20:50 skypilot_config.py:140] Config loaded:
D 05-17 10:20:50 skypilot_config.py:140] {'kubernetes': {'ports': 'loadbalancer'}}
D 05-17 10:20:50 skypilot_config.py:150] Config syntax check passed.
Task from YAML spec: ./mistral.yaml
Traceback (most recent call last):
  File "/usr/local/bin/sky", line 8, in <module>
    sys.exit(cli())
  File "/usr/lib/python3/dist-packages/click/core.py", line 764, in __call__
    return self.main(*args, **kwargs)
  File "/usr/lib/python3/dist-packages/click/core.py", line 717, in main
    rv = self.invoke(ctx)
  File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 350, in _record
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 1198, in invoke
    return super().invoke(ctx)
  File "/usr/lib/python3/dist-packages/click/core.py", line 1137, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/usr/lib/python3/dist-packages/click/core.py", line 956, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/usr/lib/python3/dist-packages/click/core.py", line 555, in invoke
    return callback(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sky/utils/common_utils.py", line 371, in _record
    return f(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 1421, in launch
    task_or_dag = _make_task_or_dag_from_entrypoint_with_overrides(
  File "/usr/local/lib/python3.8/dist-packages/sky/cli.py", line 1177, in _make_task_or_dag_from_entrypoint_with_overrides
    task.set_resources_override(override_params)
  File "/usr/local/lib/python3.8/dist-packages/sky/task.py", line 626, in set_resources_override
    new_resources = res.copy(**override_params)
  File "/usr/local/lib/python3.8/dist-packages/sky/resources.py", line 1053, in copy
    resources = Resources(
  File "/usr/local/lib/python3.8/dist-packages/sky/resources.py", line 201, in __init__
    self._set_accelerators(accelerators, accelerator_args)
  File "/usr/local/lib/python3.8/dist-packages/sky/resources.py", line 506, in _set_accelerators
    accelerators = {
  File "/usr/local/lib/python3.8/dist-packages/sky/resources.py", line 507, in <dictcomp>
    accelerator_registry.canonicalize_accelerator_name(
  File "/usr/local/lib/python3.8/dist-packages/sky/utils/accelerator_registry.py", line 117, in canonicalize_accelerator_name
    raise ValueError(f'Accelerator name {accelerator!r} is ambiguous. '
ValueError: Accelerator name 'a6000' is ambiguous. Please choose one of ['A6000', 'NVIDIA-RTX-A6000'].

We probably need to detect this case and hint to users to use canonical names for labelling if they have done it manually.

github-actions[bot] commented 4 weeks ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 2 weeks ago

This issue was closed because it has been stalled for 10 days with no activity.