SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
Follow up on #4310, we now allow 2 PENDING jobs to be scheduled concurrently, and it can get to full 32 simultaneous jobs for 1-min jobs. (> 1.4x faster)
Note: this will break the FIFO order a bit, i.e. at most one later job can be scheduled earlier than a earlier job.
We can increase the concurrent ray job submission, but it will lead to:
Breaks the FIFO order, i.e. the more concurrent ray job submission the more jobs may be scheduled in non-FIFO order.
higher memory consumption -- submitted ray jobs will consume memory
257 sky-cmd 4 mins ago - - 1x[CPU:1+] PENDING ~/sky_logs/sky-2024-11-09-09-15-53-466777
256 sky-cmd 4 mins ago a few secs ago 8s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-52-769114
255 sky-cmd 4 mins ago a few secs ago 8s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-51-749071
254 sky-cmd 4 mins ago a few secs ago 10s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-50-819021
253 sky-cmd 4 mins ago a few secs ago 10s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-49-438118
252 sky-cmd 4 mins ago 13 secs ago 13s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-48-685777
251 sky-cmd 4 mins ago 13 secs ago 13s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-48-208294
250 sky-cmd 4 mins ago 16 secs ago 16s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-47-751779
249 sky-cmd 4 mins ago 16 secs ago 16s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-47-080846
248 sky-cmd 4 mins ago 19 secs ago 19s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-46-371702
247 sky-cmd 4 mins ago 19 secs ago 19s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-45-185725
246 sky-cmd 4 mins ago 22 secs ago 22s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-44-439899
245 sky-cmd 4 mins ago 22 secs ago 22s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-43-138175
244 sky-cmd 4 mins ago 25 secs ago 25s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-42-233891
243 sky-cmd 4 mins ago 25 secs ago 25s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-41-872538
242 sky-cmd 4 mins ago 28 secs ago 28s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-41-343198
241 sky-cmd 4 mins ago 28 secs ago 28s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-40-807990
240 sky-cmd 4 mins ago 31 secs ago 31s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-40-159020
239 sky-cmd 4 mins ago 31 secs ago 31s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-38-953233
238 sky-cmd 4 mins ago 33 secs ago 33s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-38-661534
237 sky-cmd 4 mins ago 33 secs ago 33s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-36-493768
236 sky-cmd 4 mins ago 37 secs ago 37s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-36-027661
235 sky-cmd 4 mins ago 37 secs ago 37s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-35-079209
234 sky-cmd 4 mins ago 39 secs ago 39s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-34-933620
233 sky-cmd 4 mins ago 40 secs ago 40s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-33-983345
232 sky-cmd 4 mins ago 42 secs ago 42s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-33-948978
231 sky-cmd 4 mins ago 42 secs ago 42s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-32-272653
230 sky-cmd 4 mins ago 45 secs ago 45s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-31-713130
229 sky-cmd 5 mins ago 45 secs ago 45s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-29-825658
228 sky-cmd 5 mins ago 48 secs ago 48s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-28-983373
227 sky-cmd 5 mins ago 48 secs ago 48s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-28-049120
226 sky-cmd 5 mins ago 51 secs ago 51s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-27-978491
225 sky-cmd 5 mins ago 51 secs ago 51s 1x[CPU:1+] RUNNING ~/sky_logs/sky-2024-11-09-09-15-26-983769
224 sky-cmd 5 mins ago 1 min ago 1m 1x[CPU:1+] SUCCEEDED ~/sky_logs/sky-2024-11-09-09-15-26-693400
Tested (run the relevant ones):
[ ] Code formatting: bash format.sh
[ ] Any manual or new tests for this PR (please specify below)
We should think of the tradeoff of losing the strict FIFO vs the time spend for scheduling, especially that #4318 has already significantly speed up the job scheduling.
Follow up on #4310, we now allow 2 PENDING jobs to be scheduled concurrently, and it can get to full 32 simultaneous jobs for 1-min jobs. (> 1.4x faster) Note: this will break the FIFO order a bit, i.e. at most one later job can be scheduled earlier than a earlier job.
We can increase the concurrent ray job submission, but it will lead to:
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh