skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 514 forks source link

[k8s] Skip SSH setup for faster provisioning #4225

Open romilbhardwaj opened 3 weeks ago

romilbhardwaj commented 3 weeks ago

Even though #4158 significantly improves multi-node provisioning time on k8s by parallelizing SSH setup, large jobs (50 nodes+) can still take a long time (~10 min, depending on degree of parallelism/number of CPU cores) to get SSH up and running on all pods.

From user:

is it possible to make it (SSH setup) "on demand"? for example, sky ssh host_name that sets up ssh connection and then connects to it? (edited)

my 2 cents is that ssh connection is not usually necessary for these long running training jobs or at least is not necessary when we launch the job if it's mostly for user convenience. additionally, we could also ssh using tools like k9s. so it's desirable to cut off the set up time as much as possible by making this optional. this also reduces the chance for timeouts, etc.