skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[k8s] Parallelize multi-node setup #4297

Closed romilbhardwaj closed 1 week ago

romilbhardwaj commented 2 weeks ago

Supersedes #4261 and #4270.

We identified that apt update is the slowest operation and triggering it after setup takes a lot of time, so we now inline it in container init args to let kubernetes run it in parallel.

Testing on nemo image

sky launch -y -c test --num-nodes 100 --cloud kubernetes --image-id nvcr.io/nvidia/nemo:24.05.01

Master branch: 19:56.21 total

This branch: 15:26.56 total

Testing on default image:

sky launch -y -c test --num-nodes 100 --cloud kubernetes

This branch: 4:13.41 total

Tested (run the relevant ones):