volcano-sh / volcano

A Cloud Native Batch System (Project under CNCF)
https://volcano.sh
Apache License 2.0
4.28k stars 977 forks source link

Scale task replicas of Volcan Jobs #2639

Open vinaydhegde opened 1 year ago

vinaydhegde commented 1 year ago

What would you like to be added:

I would like to add scaling feature to the CRD jobs.batch.volcano.sh 'kubectl scale --replicas= jobs.batch.volcano.sh ' should scale the replicas of worker task

Why is this needed:

This is needed to scale the jobs (scale up/down the number of PODs) based on CPU load.

What I tried so far

I added a subresources block to the CRD jobs.batch.volcano.sh, but when I run 'kubectl scale' command doesn't do anything (it says resource is scaled, but replicas are not getting updated). Below is the code block I added to the CRD jobs.batch.volcano.sh (kubectl edit customresourcedefinition jobs.batch.volcano.sh). subresources: scale: specReplicasPath: .spec.tasks[1].replicas statusReplicasPath: .status.tasks[1].replicas status: {}

Status: block of this CRD doesn't have a replicas: field. Since we cannot keep the statusReplicasPath: field empty, I have set the value to .status.tasks[1].replicas

FYI: I referred Kubernetes Doc to try this scaling option

wangyang0616 commented 1 year ago

This is a really nice feature, is there anything I can do together?

wangyang0616 commented 1 year ago

Volcano currently does not support the expansion and contraction of task replicas through Scale subresource.

There are multiple tasks in a single Volcano job, and the specReplicasPath in the crd cannot be used to distinguish which task’s replicas should be expanded. In addition, the status of the task is a map structure, and the key value is the name of the task. statusReplicasPath cannot be configured in the crd.

For job-level replicas expansion and contraction, the volcano job does not have the replicas attribute, but has minAvailble data, which indicates the minimum number of replicas that need to be met in the job. This supports the Scale subresource function, and the following configurations can be performed:

subresources:
       scale:
         specReplicasPath: .spec.minAvailable
         statusReplicasPath: .status.running

After verification by kubectl scale --replicas=8 vcjob/job-xxx, the number of minAvailable can be modified, but it seems meaningless to expand and shrink minAvailble in practical applications.

The Volcano job itself has the ability to provide elastic expansion and contraction of the replica, and combines the replicas and minAvailble of the task and the minAvailble of the job to achieve the elasticity of the replica. When the job and task meet the minAvailble resource requirements, the job can run. If there are more resources in the cluster, resources will continue to be allocated to the job until the number of replicas is met.

By the way, in what scenario is the elastic expansion and contraction of the number of copies in the task mainly used? Could you please share it in detail?

stale[bot] commented 1 year ago

Hello 👋 Looks like there was no activity on this issue for last 90 days. Do you mind updating us on the status? Is this still reproducible or needed? If yes, just comment on this PR or push a commit. Thanks! 🤗 If there will be no activity for 60 days, this issue will be closed (we can always reopen an issue if we need!).

stale[bot] commented 1 year ago

Closing for now as there was no activity for last 60 days after marked as stale, let us know if you need this to be reopened! 🤗

lowang-bh commented 1 year ago

/reopen

volcano-sh-bot commented 1 year ago

@lowang-bh: Reopened this issue.

In response to [this](https://github.com/volcano-sh/volcano/issues/2639#issuecomment-1672476037): >/reopen Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
sea-wyq commented 1 year ago

Currently, can volcano job dynamically increase or decrease the number of task Pods based on requested resources and cluster free resources?