Open reconlabs-marshall opened 2 weeks ago
This feature makes sense, would you be willing to contribute it?
This feature makes sense, would you be willing to contribute it?
I'm not familiar to golang.. I've schemed the related code but couldn't figure out how to implement that feature.
I have resolved this issue by setting batch job quota to namespace. But if there is someone that actively uses jobs, it may not be an option.
Proposal
I'm designing cluster that runs multiple kind of ScaledJob that consumes AWS SQS queue. The cluster has limited resources.
The job runs a small 'controller' app, and that app initiates resource-consuming long-running ML workflow via Kubeflow.
Problem is, when I set maxReplicaCount to each ScaledJob, they will all run 'controller' to maximum count, but most of them are hanged because cluster does not have enough resources to run them simultaneously.
So it would be great to limit the overall number of ScaledJob in cluster.
Use-Case
If there is some global variable like
ScaledJobQuota
to limit maximum number of ScaledJobs, it would be helpful to limit number of jobs to number of GPUs or resources available.Is this a feature you are interested in implementing yourself?
No
Anything else?
No response