[Core] A potential critical race condition for job scheduling within a cluster

Michaelvll commented 4 days ago

A potential race that can cause a job in FAILED state without actually being scheduled:

add_job is called, and the job is added to job table with status INIT.
update_job_status is called before the actual job is added to the pending table (by _exec_code_on_head), and get to following: https://github.com/skypilot-org/skypilot/blob/71a95f4bf7f1446e80bb5c24d23c1695bc4fc031/sky/skylet/job_lib.py#L593
The job is then added to the pending table with the status set to PENDING, but since we checked pending table earlier, and job was not in the pending table, status below will be None, so the job status will be set to FAILED after the job is just added to the pending table.
With the job status being FAILED, the scheduler will not pick it up, and remove it from the pending table, causing the job never being scheduled.

We should add per-job lock for the whole update_job_status() function to avoid this race.

This issue happens sometime when python examples/multi_echo.py is called.

cblmemo commented 19 hours ago

Hi @Michaelvll, I'm taking a look into this and was a little bit confused about what is the design purpose of the INIT state. Just a random thought, seems like setting it to PENDING when we adding the job could resolve the problem?

cblmemo commented 19 hours ago

Also I'm a little bit confused why adding the per-job lock for the whole update_job_status() function would help the situation. IIUC the problem here is that the update_job_status is called after the add_job and before the _exec_code_on_head, which suggesting we should have a lock to protect any function to operate the job table between add_job and _exec_code_on_head (i.e. add lock if the job status is INIT). Why does adding locks to update_job_status() could help this?

Michaelvll commented 18 hours ago

Also I'm a little bit confused why adding the per-job lock for the whole update_job_status() function would help the situation. IIUC the problem here is that the update_job_status is called after the add_job and before the _exec_code_on_head, which suggesting we should have a lock to protect any function to operate the job table between add_job and _exec_code_on_head (i.e. add lock if the job status is INIT). Why does adding locks to update_job_status() could help this?

When multiple jobs are submitted, there can be multiple schedule() called simultaneously, and also we have a periodic update running in the skylet to fix any potential issue with the job queue.

Michaelvll commented 18 hours ago

Hi @Michaelvll, I'm taking a look into this and was a little bit confused about what is the design purpose of the INIT state. Just a random thought, seems like setting it to PENDING when we adding the job could resolve the problem?

INIT state means a placeholder for the job id is created, though the job is not actually pending yet. This is a limitation of our current implementation, i.e., we have to add a placeholder for the job to be submitted in the job table on the remote VM first to get job_id for that job and then submit the job to the cluster.

skypilot-org / skypilot

[Core] A potential critical race condition for job scheduling within a cluster #4133