Open romilbhardwaj opened 1 year ago
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.
This issue was closed because it has been stalled for 10 days with no activity.
This is still an important issue to fix, esp when the remote k8s API server has a high latency.
Whenever a resource object is copied, we end up calling
canonicalize_accelerator_name
:https://github.com/skypilot-org/skypilot/blob/a3311f691776b8eff3a9d22a5a06e4d7959fd201/sky/utils/accelerator_registry.py#L43
If the accelerator name is not in registered
_ACCELERATORS
, it ends up callingservice_catalog.list_accelerators
to check if the user is using any custom accelerators. This can be an expensive call, especially in the case of Kubernetes where it makes alist_node
API call. Since resource.copy() is invoked many times in our optimizer, it can significantly increase the time taken to optimize.For instance, before #2724, running
sky launch --gpus L4:1
with a Kubernetes cluster would take multiple minutes. See py-spy logs.Two action items:
(Thanks to @Michaelvll for catching this)