skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 514 forks source link

[Jobs] Managed job controller process taking too much memory during peak time #4243

Open Michaelvll opened 3 weeks ago

Michaelvll commented 3 weeks ago

We only limit the number of parallel running managed jobs with the number of CPU core each job controller process uses, but that is not enough because:

  1. during peak time, when there are multiple parallel jobs doing sky launch it can still experience OOM.
  2. when a user is using a controller instance with memory/cpu ratio <4.

A potential solution:

  1. Adopt similar resource management from SkyServe to limit the number of parallel launches to guard against high memory consumption.

To reproduce:

for i in `seq 1 100`; do
  sky jobs launch --fast -n test-job-$i -yd --fast "echo hi; sleep 240" &
done   
wait

If we run sky jobs queue after a while, we can see FAILED_CONTROLLER for the some of the jobs (mainly because the OOM)

Version & Commit info:

cblmemo commented 3 weeks ago

I'm following SkyServe's implementation to apply CPU limits to sky.launch. At the same time also memory limits to number of concurrent jobs for now. Do we have some results showing the OOM is caused by sky.launch but not ray job management? cuz I remember otherwise but feel free to correct me if I'm wrong ;)