skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[Core] Allow more PENDING jobs to be scheduled concurrently (1.4x faster) #4311

Open Michaelvll opened 1 week ago

Michaelvll commented 1 week ago

Follow up on #4310, we now allow 2 PENDING jobs to be scheduled concurrently, and it can get to full 32 simultaneous jobs for 1-min jobs. (> 1.4x faster) Note: this will break the FIFO order a bit, i.e. at most one later job can be scheduled earlier than a earlier job.

We can increase the concurrent ray job submission, but it will lead to:

  1. Breaks the FIFO order, i.e. the more concurrent ray job submission the more jobs may be scheduled in non-FIFO order.
  2. higher memory consumption -- submitted ray jobs will consume memory
257  sky-cmd  4 mins ago      -               -         1x[CPU:1+]  PENDING    ~/sky_logs/sky-2024-11-09-09-15-53-466777  
256  sky-cmd  4 mins ago      a few secs ago  8s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-52-769114  
255  sky-cmd  4 mins ago      a few secs ago  8s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-51-749071  
254  sky-cmd  4 mins ago      a few secs ago  10s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-50-819021  
253  sky-cmd  4 mins ago      a few secs ago  10s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-49-438118  
252  sky-cmd  4 mins ago      13 secs ago     13s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-48-685777  
251  sky-cmd  4 mins ago      13 secs ago     13s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-48-208294  
250  sky-cmd  4 mins ago      16 secs ago     16s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-47-751779  
249  sky-cmd  4 mins ago      16 secs ago     16s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-47-080846  
248  sky-cmd  4 mins ago      19 secs ago     19s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-46-371702  
247  sky-cmd  4 mins ago      19 secs ago     19s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-45-185725  
246  sky-cmd  4 mins ago      22 secs ago     22s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-44-439899  
245  sky-cmd  4 mins ago      22 secs ago     22s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-43-138175  
244  sky-cmd  4 mins ago      25 secs ago     25s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-42-233891  
243  sky-cmd  4 mins ago      25 secs ago     25s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-41-872538  
242  sky-cmd  4 mins ago      28 secs ago     28s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-41-343198  
241  sky-cmd  4 mins ago      28 secs ago     28s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-40-807990  
240  sky-cmd  4 mins ago      31 secs ago     31s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-40-159020  
239  sky-cmd  4 mins ago      31 secs ago     31s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-38-953233  
238  sky-cmd  4 mins ago      33 secs ago     33s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-38-661534  
237  sky-cmd  4 mins ago      33 secs ago     33s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-36-493768  
236  sky-cmd  4 mins ago      37 secs ago     37s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-36-027661  
235  sky-cmd  4 mins ago      37 secs ago     37s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-35-079209  
234  sky-cmd  4 mins ago      39 secs ago     39s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-34-933620  
233  sky-cmd  4 mins ago      40 secs ago     40s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-33-983345  
232  sky-cmd  4 mins ago      42 secs ago     42s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-33-948978  
231  sky-cmd  4 mins ago      42 secs ago     42s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-32-272653  
230  sky-cmd  4 mins ago      45 secs ago     45s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-31-713130  
229  sky-cmd  5 mins ago      45 secs ago     45s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-29-825658  
228  sky-cmd  5 mins ago      48 secs ago     48s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-28-983373  
227  sky-cmd  5 mins ago      48 secs ago     48s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-28-049120  
226  sky-cmd  5 mins ago      51 secs ago     51s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-27-978491  
225  sky-cmd  5 mins ago      51 secs ago     51s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-09-15-26-983769  
224  sky-cmd  5 mins ago      1 min ago       1m        1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-09-09-15-26-693400 

Tested (run the relevant ones):

Michaelvll commented 1 week ago

We should think of the tradeoff of losing the strict FIFO vs the time spend for scheduling, especially that #4318 has already significantly speed up the job scheduling.