Open Michaelvll opened 4 days ago
Hi @Michaelvll, I'm taking a look into this and was a little bit confused about what is the design purpose of the INIT state. Just a random thought, seems like setting it to PENDING when we adding the job could resolve the problem?
Also I'm a little bit confused why adding the per-job lock for the whole update_job_status()
function would help the situation. IIUC the problem here is that the update_job_status
is called after the add_job
and before the _exec_code_on_head
, which suggesting we should have a lock to protect any function to operate the job table between add_job
and _exec_code_on_head
(i.e. add lock if the job status is INIT). Why does adding locks to update_job_status()
could help this?
Also I'm a little bit confused why adding the per-job lock for the whole
update_job_status()
function would help the situation. IIUC the problem here is that theupdate_job_status
is called after theadd_job
and before the_exec_code_on_head
, which suggesting we should have a lock to protect any function to operate the job table betweenadd_job
and_exec_code_on_head
(i.e. add lock if the job status is INIT). Why does adding locks toupdate_job_status()
could help this?
When multiple jobs are submitted, there can be multiple schedule() called simultaneously, and also we have a periodic update running in the skylet to fix any potential issue with the job queue.
Hi @Michaelvll, I'm taking a look into this and was a little bit confused about what is the design purpose of the INIT state. Just a random thought, seems like setting it to PENDING when we adding the job could resolve the problem?
INIT
state means a placeholder for the job id is created, though the job is not actually pending yet. This is a limitation of our current implementation, i.e., we have to add a placeholder for the job to be submitted in the job table on the remote VM first to get job_id
for that job and then submit the job to the cluster.
A potential race that can cause a job in
FAILED
state without actually being scheduled:add_job
is called, and the job is added to job table with statusINIT
.update_job_status
is called before the actual job is added to the pending table (by_exec_code_on_head
), and get to following: https://github.com/skypilot-org/skypilot/blob/71a95f4bf7f1446e80bb5c24d23c1695bc4fc031/sky/skylet/job_lib.py#L593status
below will be None, so the job status will be set to FAILED after the job is just added to the pending table.We should add per-job lock for the whole update_job_status() function to avoid this race.
This issue happens sometime when
python examples/multi_echo.py
is called.