skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.86k stars 518 forks source link

[core][k8s][perf] `service_catalog.list_accelerators` is called on every `resource.copy()` #2725

Open romilbhardwaj opened 1 year ago

romilbhardwaj commented 1 year ago

Whenever a resource object is copied, we end up calling canonicalize_accelerator_name:

https://github.com/skypilot-org/skypilot/blob/a3311f691776b8eff3a9d22a5a06e4d7959fd201/sky/utils/accelerator_registry.py#L43

If the accelerator name is not in registered _ACCELERATORS, it ends up calling service_catalog.list_accelerators to check if the user is using any custom accelerators. This can be an expensive call, especially in the case of Kubernetes where it makes a list_node API call. Since resource.copy() is invoked many times in our optimizer, it can significantly increase the time taken to optimize.

For instance, before #2724, running sky launch --gpus L4:1 with a Kubernetes cluster would take multiple minutes. See py-spy logs.

Two action items:

  1. In our Kubernetes implementation, we should look into TTL-caching our calls to list_node (and potentially other read operations).
  2. More generally, we should see if it's possible to reuse the canonical accelerator name when resource.copy() is invoked.

(Thanks to @Michaelvll for catching this)

github-actions[bot] commented 9 months ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 1 month ago

This issue is stale because it has been open 120 days with no activity. Remove stale label or comment or this will be closed in 10 days.

github-actions[bot] commented 3 weeks ago

This issue was closed because it has been stalled for 10 days with no activity.

romilbhardwaj commented 3 weeks ago

This is still an important issue to fix, esp when the remote k8s API server has a high latency.