skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[k8s] Add flag to disable ssh setup #4261

Closed romilbhardwaj closed 2 weeks ago

romilbhardwaj commented 2 weeks ago

Requested by user. Adds a flag to disable SSH setup, which can take ~10s per pod.

Depending on degree of parallelism available in an environment and total number of pods to provisioning, this could add delays of anywhere between 10 seconds and n*10 seconds.

Tested (run the relevant ones):

num_nodes: 100

experimental: config_overrides: kubernetes: disable_ssh: true

romilbhardwaj commented 2 weeks ago

Added conditional checks for apt update and apt install to further reduce provisioning time.

Breakdown of SSH provisioning time

Master: Total 13s to setup SSH

This branch: Total 1s to setup SSH

Comparing with master and different values of disable_ssh:

Tested with single nodesky launch -y -c test --cloud kubernetes --image-id nvcr.io/nvidia/nemo:24.05.01

Master: ~2min 2s

Runs
sky launch -y -c test --cloud kubernetes --image-id task4.yaml 5.73s user 3.97s system 7% cpu 2:02.55 total sky launch -y -c test --cloud kubernetes --image-id task4.yaml 6.31s user 4.12s system 7% cpu 2:04.55 total sky launch -y -c test --cloud kubernetes --image-id task4.yaml 5.68s user 3.94s system 7% cpu 2:03.55 total

This branch, without disable SSH: ~1 min 56s

Runs
sky launch -y -c test --cloud kubernetes --image-id task4.yaml 4.84s user 3.78s system 7% cpu 1:58.31 total sky launch -y -c test --cloud kubernetes --image-id task4.yaml 5.41s user 3.71s system 7% cpu 1:56.03 total sky launch -y -c test --cloud kubernetes --image-id task4.yaml 5.55s user 3.74s system 8% cpu 1:55.17 total

This branch, with disable SSH: ~ 1 min 52s

Runs
sky launch -y -c test --cloud kubernetes --image-id task4.yaml 4.71s user 3.72s system 7% cpu 1:51.74 total sky launch -y -c test --cloud kubernetes --image-id task4.yaml 5.76s user 3.82s system 8% cpu 1:54.76 total sky launch -y -c test --cloud kubernetes --image-id task4.yaml 5.13s user 3.72s system 7% cpu 1:52.71 total

Given that single node provisioning time is reduces by ~4s when SSH is disabled, and ~10s compared to master (with larger improvements for multi-node), it might be worthwhile to keep the disable_ssh flag.

romilbhardwaj commented 2 weeks ago

Identified that apt update is the slowest operation. Refactored to run apt-update in container init args, so it is run in parallel from the start.

Testing on nemo image: sky launch -y -c test --num-nodes 100 --cloud kubernetes --image-id nvcr.io/nvidia/nemo:24.05.01

Master branch: 19:56.21 total

this branch, SSH enabled: 15:26.56 total

this branch, SSH disabled: 14:47.28 total

Given it's a 40 second overhead to wait for SSH on 100 nodes , it might not be worthwhile disabling SSH (and having the flag to disable SSH). I'll close this PR for now and open a new PR with optimizations.