SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
It also speed up the job scheduling by avoid update status for all the jobs:
Previously, gap between jobs to be scheduled is ~4-5s, ie with 60s duration of a task, there will be 12 jobs running in parallel (the additional RUNNING (with more than 1 min job duration) ones below are invoking the scheduling step for jobs which takes a long time waiting for the job scheduler to finish).
221 sky-cmd 11 mins ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-05-19-30-10-870719
220 sky-cmd 11 mins ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-05-19-30-09-092916
219 sky-cmd 11 mins ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-05-19-30-07-104709
218 sky-cmd 11 mins ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-05-19-30-06-950815
217 sky-cmd 11 mins ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-05-19-30-06-835127
216 sky-cmd 11 mins ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-05-19-30-04-966880
215 sky-cmd 11 mins ago a few secs ago 1s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-30-04-499591
214 sky-cmd 11 mins ago a few secs ago 5s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-30-03-997782
213 sky-cmd 11 mins ago a few secs ago 9s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-30-03-401201
212 sky-cmd 11 mins ago 14 secs ago 14s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-30-02-683537
211 sky-cmd 11 mins ago 18 secs ago 18s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-58-437107
210 sky-cmd 11 mins ago 22 secs ago 22s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-58-121545
209 sky-cmd 11 mins ago 27 secs ago 27s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-58-030503
208 sky-cmd 11 mins ago 30 secs ago 30s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-55-595568
207 sky-cmd 11 mins ago 36 secs ago 36s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-55-308832
206 sky-cmd 11 mins ago 41 secs ago 41s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-54-752160
205 sky-cmd 11 mins ago 46 secs ago 46s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-54-840566
204 sky-cmd 11 mins ago 52 secs ago 52s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-54-647147
203 sky-cmd 11 mins ago 56 secs ago 56s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-51-826498
202 sky-cmd 11 mins ago 1 min ago 1m 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-51-525987
201 sky-cmd 11 mins ago 1 min ago 1m 4s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-51-529238
200 sky-cmd 11 mins ago 1 min ago 1m 12s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-49-392753
199 sky-cmd 11 mins ago 1 min ago 1m 16s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-48-211998
198 sky-cmd 11 mins ago 1 min ago 1m 20s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-47-018499
197 sky-cmd 11 mins ago 1 min ago 1m 24s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-46-732420
196 sky-cmd 11 mins ago 1 min ago 1m 29s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-46-527903
195 sky-cmd 11 mins ago 1 min ago 1m 34s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-44-218424
194 sky-cmd 11 mins ago 1 min ago 1m 40s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-43-986796
193 sky-cmd 11 mins ago 1 min ago 1m 44s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-19-29-44-039565
192 sky-cmd 11 mins ago 1 min ago 1m 45s 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-05-19-29-42-571895
Now, it takes ~2-3 seconds for each job to be submitted, so there can be 21 parallel jobs, and there are no jobs with duration significantly longer than 1 minute.
339 sky-cmd 11 mins ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-05-23-31-46-294901
338 sky-cmd 11 mins ago a few secs ago 2s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-46-151382
337 sky-cmd 11 mins ago a few secs ago 5s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-44-053912
336 sky-cmd 11 mins ago a few secs ago 8s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-41-961098
335 sky-cmd 11 mins ago 11 secs ago 11s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-41-919819
334 sky-cmd 11 mins ago 14 secs ago 14s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-41-611173
333 sky-cmd 11 mins ago 16 secs ago 16s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-41-550143
332 sky-cmd 11 mins ago 19 secs ago 19s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-41-136605
331 sky-cmd 11 mins ago 22 secs ago 22s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-40-268279
330 sky-cmd 11 mins ago 25 secs ago 25s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-40-298456
329 sky-cmd 11 mins ago 28 secs ago 28s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-38-036070
328 sky-cmd 11 mins ago 30 secs ago 30s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-35-625810
327 sky-cmd 11 mins ago 33 secs ago 33s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-35-368378
326 sky-cmd 11 mins ago 36 secs ago 36s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-35-547889
325 sky-cmd 11 mins ago 39 secs ago 39s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-35-271033
324 sky-cmd 11 mins ago 41 secs ago 41s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-34-871366
323 sky-cmd 11 mins ago 44 secs ago 44s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-34-202901
322 sky-cmd 11 mins ago 47 secs ago 47s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-34-022552
321 sky-cmd 11 mins ago 50 secs ago 50s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-31-843038
320 sky-cmd 11 mins ago 53 secs ago 53s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-29-105083
319 sky-cmd 11 mins ago 56 secs ago 56s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-29-240834
318 sky-cmd 11 mins ago 58 secs ago 58s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-05-23-31-28-474097
317 sky-cmd 11 mins ago 1 min ago 1m 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-05-23-31-28-665341
Tested (run the relevant ones):
[ ] Code formatting: bash format.sh
[x] Any manual or new tests for this PR (please specify below)
[x] Reproducible script in #4263
[x] All smoke tests: pytest tests/test_smoke.py --aws
Fixes #4263
It also speed up the job scheduling by avoid update status for all the jobs: Previously, gap between jobs to be scheduled is ~4-5s, ie with 60s duration of a task, there will be 12 jobs running in parallel (the additional RUNNING (with more than 1 min job duration) ones below are invoking the scheduling step for jobs which takes a long time waiting for the job scheduler to finish).
Now, it takes ~2-3 seconds for each job to be submitted, so there can be 21 parallel jobs, and there are no jobs with duration significantly longer than 1 minute.
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py --aws
pytest tests/test_smoke.py::test_fill_in_the_name
pytest tests/test_smoke.py::test_multi_echo
conda deactivate; bash -i tests/backward_compatibility_tests.sh