skypilot-org / skypilot

SkyPilot: Run AI and batch jobs on any infra (Kubernetes or 12+ clouds). Get unified execution, cost savings, and high GPU availability via a simple interface.
https://skypilot.readthedocs.io
Apache License 2.0
6.82k stars 513 forks source link

[Jobs] Limit number of concurrent jobs & launches. #4248

Open cblmemo opened 2 weeks ago

cblmemo commented 2 weeks ago

Fixes #4243.

This PR adds memory limitations for the number of concurrently running jobs, and CPU limitations for the number of concurrent sky launch by the jobs controller.

I followed SkyServe's implementation to only apply CPU limit to concurrent launches, as IIRC sky.launch consumes more compute than memory. Also, only apply memory limits to the number of concurrent jobs as ray jobs consume more memory.

Tested (run the relevant ones):