Closed Michaelvll closed 1 week ago
Actually, do you think we should add job specific lock for all job-related actions?
Actually, do you think we should add job specific lock for all job-related actions?
Do you have any examples where we should add the lock?
Actually, do you think we should add job specific lock for all job-related actions?
Do you have any examples where we should add the lock?
e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me
This is the only one I can find, but not sure if I missed any place
Actually, do you think we should add job specific lock for all job-related actions?
Do you have any examples where we should add the lock?
e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me
This is the only one I can find, but not sure if I missed any place
Ahh, good catch! I fixed that in #4318, but did not adopt it in this one. Let me do it now.
Actually, do you think we should add job specific lock for all job-related actions?
Do you have any examples where we should add the lock?
e.g. in this function we are getting all information of the jobs altogether, and cancel them one-by-one. Not sure if it is possible that some job information is stale, but adding lock to every place looks safer to me https://github.com/skypilot-org/skypilot/blob/42c79e1d0a5e018e275705ada53957573f9a0181/sky/skylet/job_lib.py#L749-L762
This is the only one I can find, but not sure if I missed any place
Ahh, good catch! I fixed that in #4318, but did not adopt it in this one. Let me do it now.
Thanks! LGTM.
Another race condition in job scheduling besides #4264 ...
The pending jobs should be queried during the pending loop, otherwise, a same job can be submitted twice to ray job in the following condition.
schedule_step()
and both get the list of pending jobs23 jobs in parallel now, a bit more than #4264
Tested (run the relevant ones):
bash format.sh
pytest tests/test_smoke.py
pytest tests/test_smoke.py::test_fill_in_the_name
conda deactivate; bash -i tests/backward_compatibility_tests.sh