skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[k8s] Improve multi-node provisioning time (nimbus) #4393

Closed romilbhardwaj closed 10 hours ago

romilbhardwaj commented 16 hours ago

This PR introduces a bunch of optimizations for large scale k8s provisioning, including:

Delivers > 4x speedup when provisioning 100s of nodes.

SKYPILOT_TIMELINE_FILE_PATH='timeline_100.prof' sky launch -y -c test --cloud kubernetes --num-nodes 100 --cpus 1 -- echo hi

Similar times with a NeMo derived image optimized with instructions in this PR.

Tested (run the relevant ones):

romilbhardwaj commented 13 hours ago

Starting to run some final tests. @cg505 @Michaelvll if you find some time please do a quick round of reviews. Thanks!

romilbhardwaj commented 13 hours ago

Should add uv to our base images. Takes ~2s to install it otherwise.

cg505 commented 13 hours ago

Should add uv to our base images. Takes ~2s to install it otherwise.

working on this now

romilbhardwaj commented 12 hours ago

Running smoke tests:

cg505 commented 11 hours ago

Ran backwards compatibility tests, no issues.

romilbhardwaj commented 11 hours ago

Smoke tests for aws and k8s pass (barring a few unrelated failures). Merging now. Thanks for the great work @cg505 and @Michaelvll!