Global maxReplicaCount for ScaledJobs

reconlabs-marshall commented 2 weeks ago

Proposal

I'm designing cluster that runs multiple kind of ScaledJob that consumes AWS SQS queue. The cluster has limited resources.

The job runs a small 'controller' app, and that app initiates resource-consuming long-running ML workflow via Kubeflow.

Problem is, when I set maxReplicaCount to each ScaledJob, they will all run 'controller' to maximum count, but most of them are hanged because cluster does not have enough resources to run them simultaneously.

So it would be great to limit the overall number of ScaledJob in cluster.

Use-Case

If there is some global variable like ScaledJobQuota to limit maximum number of ScaledJobs, it would be helpful to limit number of jobs to number of GPUs or resources available.

Is this a feature you are interested in implementing yourself?

No

Anything else?

No response

zroubalik commented 3 days ago

This feature makes sense, would you be willing to contribute it?

reconlabs-marshall commented 3 days ago

This feature makes sense, would you be willing to contribute it?

I'm not familiar to golang.. I've schemed the related code but couldn't figure out how to implement that feature.

I have resolved this issue by setting batch job quota to namespace. But if there is someone that actively uses jobs, it may not be an option.

kedacore / keda