Closed romilbhardwaj closed 2 weeks ago
Added conditional checks for apt update
and apt install
to further reduce provisioning time.
Master: Total 13s to setup SSH
This branch: Total 1s to setup SSH
disable_ssh
:Tested with single nodesky launch -y -c test --cloud kubernetes --image-id nvcr.io/nvidia/nemo:24.05.01
Master: ~2min 2s
This branch, without disable SSH: ~1 min 56s
This branch, with disable SSH: ~ 1 min 52s
Given that single node provisioning time is reduces by ~4s when SSH is disabled, and ~10s compared to master (with larger improvements for multi-node), it might be worthwhile to keep the disable_ssh flag.
Identified that apt update
is the slowest operation. Refactored to run apt-update in container init args, so it is run in parallel from the start.
Testing on nemo image: sky launch -y -c test --num-nodes 100 --cloud kubernetes --image-id nvcr.io/nvidia/nemo:24.05.01
Master branch: 19:56.21 total
this branch, SSH enabled: 15:26.56 total
this branch, SSH disabled: 14:47.28 total
Given it's a 40 second overhead to wait for SSH on 100 nodes , it might not be worthwhile disabling SSH (and having the flag to disable SSH). I'll close this PR for now and open a new PR with optimizations.
Requested by user. Adds a flag to disable SSH setup, which can take ~10s per pod.
Depending on degree of parallelism available in an environment and total number of pods to provisioning, this could add delays of anywhere between 10 seconds and n*10 seconds.
Tested (run the relevant ones):
bash format.sh
sky launch
with SkyPilot image and custom imagenum_nodes: 100
experimental: config_overrides: kubernetes: disable_ssh: true