Open Michaelvll opened 3 weeks ago
I'm following SkyServe's implementation to apply CPU limits to sky.launch
. At the same time also memory limits to number of concurrent jobs for now. Do we have some results showing the OOM is caused by sky.launch
but not ray job management? cuz I remember otherwise but feel free to correct me if I'm wrong ;)
We only limit the number of parallel running managed jobs with the number of CPU core each job controller process uses, but that is not enough because:
sky launch
it can still experience OOM.A potential solution:
To reproduce:
If we run
sky jobs queue
after a while, we can seeFAILED_CONTROLLER
for the some of the jobs (mainly because the OOM)Version & Commit info:
sky -v
: PLEASE_FILL_INsky -c
: PLEASE_FILL_IN