skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.81k stars 513 forks source link

[Core] Avoid job scheduling race condition #4310

Closed Michaelvll closed 1 week ago

Michaelvll commented 1 week ago

Another race condition in job scheduling besides #4264 ...

The pending jobs should be queried during the pending loop, otherwise, a same job can be submitted twice to ray job in the following condition.

  1. Two threads calling schedule_step() and both get the list of pending jobs
  2. The first thread submitted the first pending job
  3. The second thread look at the stale pending job list and found the first job not submitted yet, so it will submit the job again.

23 jobs in parallel now, a bit more than #4264

290  sky-cmd  8 mins ago      -               -         1x[CPU:1+]  PENDING    ~/sky_logs/sky-2024-11-09-04-39-25-046022  
289  sky-cmd  8 mins ago      a few secs ago  2s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-23-183791  
288  sky-cmd  8 mins ago      a few secs ago  5s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-22-702691  
287  sky-cmd  8 mins ago      a few secs ago  7s        1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-22-167590  
286  sky-cmd  8 mins ago      a few secs ago  10s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-21-982999  
285  sky-cmd  8 mins ago      12 secs ago     12s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-21-824679  
284  sky-cmd  8 mins ago      15 secs ago     15s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-20-840780  
283  sky-cmd  8 mins ago      18 secs ago     18s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-19-944765  
282  sky-cmd  8 mins ago      20 secs ago     20s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-18-955162  
281  sky-cmd  8 mins ago      23 secs ago     23s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-17-037321  
280  sky-cmd  8 mins ago      25 secs ago     25s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-16-274941  
279  sky-cmd  8 mins ago      28 secs ago     28s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-15-639855  
278  sky-cmd  8 mins ago      30 secs ago     30s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-15-495491  
277  sky-cmd  8 mins ago      33 secs ago     33s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-15-271224  
276  sky-cmd  8 mins ago      36 secs ago     36s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-14-332769  
275  sky-cmd  8 mins ago      38 secs ago     38s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-13-681088  
274  sky-cmd  8 mins ago      41 secs ago     41s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-13-161247  
273  sky-cmd  8 mins ago      44 secs ago     44s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-11-060571  
272  sky-cmd  8 mins ago      46 secs ago     46s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-10-313379  
271  sky-cmd  8 mins ago      49 secs ago     49s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-09-829889  
270  sky-cmd  8 mins ago      52 secs ago     52s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-08-895334  
269  sky-cmd  8 mins ago      54 secs ago     54s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-08-861447  
268  sky-cmd  8 mins ago      57 secs ago     57s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-08-065965  
267  sky-cmd  8 mins ago      59 secs ago     59s       1x[CPU:1+]  RUNNING    ~/sky_logs/sky-2024-11-09-04-39-07-076732  
266  sky-cmd  8 mins ago      1 min ago       1m        1x[CPU:1+]  SUCCEEDED  ~/sky_logs/sky-2024-11-09-04-39-06-624532

Tested (run the relevant ones):

cblmemo commented 1 week ago

Actually, do you think we should add job specific lock for all job-related actions?

Michaelvll commented 1 week ago

Actually, do you think we should add job specific lock for all job-related actions?

Do you have any examples where we should add the lock?

cblmemo commented 1 week ago

Actually, do you think we should add job specific lock for all job-related actions?

Do you have any examples where we should add the lock?

e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me

https://github.com/skypilot-org/skypilot/blob/42c79e1d0a5e018e275705ada53957573f9a0181/sky/skylet/job_lib.py#L749-L762

This is the only one I can find, but not sure if I missed any place

Michaelvll commented 1 week ago

Actually, do you think we should add job specific lock for all job-related actions?

Do you have any examples where we should add the lock?

e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me

https://github.com/skypilot-org/skypilot/blob/42c79e1d0a5e018e275705ada53957573f9a0181/sky/skylet/job_lib.py#L749-L762

This is the only one I can find, but not sure if I missed any place

Ahh, good catch! I fixed that in #4318, but did not adopt it in this one. Let me do it now.

cblmemo commented 1 week ago

Actually, do you think we should add job specific lock for all job-related actions?

Do you have any examples where we should add the lock?

e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me https://github.com/skypilot-org/skypilot/blob/42c79e1d0a5e018e275705ada53957573f9a0181/sky/skylet/job_lib.py#L749-L762

This is the only one I can find, but not sure if I missed any place

Ahh, good catch! I fixed that in #4318, but did not adopt it in this one. Let me do it now.

Thanks! LGTM.