kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Respect SchedulingPolicy #520

Closed tenzen-y closed 1 year ago

tenzen-y commented 1 year ago

Signed-off-by: Yuki Iwai yuki.iwai.tz@gmail.com

I changed the logic for the gang-scheduling so that the mpi-operator respects SchedulingPolicy when creating PodGroup. Mainly, I modified the following:

  1. When setting a priorityClass to PodGroup, "SchedulingPolicy.PriorityClass" is preferred over "ReplicaSpecs.PodTemplateSpec".
  2. When setting a queueName to PodGroup, "SchedulingPolicy.Queue" is preferred over annotation "scheduling.volcano.sh/queue-name".

~~3. Set "PodGroupSpec.MinResources". a. iff "SchedulingPolicy.MinAvailable" isn't empty, propagate that to PodGroup. b. In the case of PodGroupSpec.MinMember < MPIJobSpec.MPIReplicaSpecs[Worker].Replicas + 1, sort in descending order "MPIJobSpec.MPIReplicaSpecs" according to priorityClass, and then add container resources to "PodGroupSpec.MinResources". However, the total value of "MPIJobSpec.MPIReplicaSpec.Replicas" to be added must not exceed "PodGroupSpec.MinMember".~~

Fixes: #518

/assign @alculquicondor

alculquicondor commented 1 year ago

Can you summarize the changes for somebody that isn't familiar with volcano? :sweat_smile:

tenzen-y commented 1 year ago

Can you summarize the changes for somebody that isn't familiar with volcano? 😅

Sure. I will send a ping to you once this PR description is ready.

tenzen-y commented 1 year ago

@alculquicondor Updated PR description.

tenzen-y commented 1 year ago

@alculquicondor I have addressed your comments. Please take another look.

tenzen-y commented 1 year ago

@alculquicondor Updated. PTAL.

tenzen-y commented 1 year ago

@alculquicondor I addressed your suggestions and squashed commits into one.

tenzen-y commented 1 year ago

Also, I will add docs for the schedulingPolicy to https://www.kubeflow.org/.

alculquicondor commented 1 year ago

/lgtm

tenzen-y commented 1 year ago

@alculquicondor Created https://github.com/kubeflow/website/pull/3453.

tenzen-y commented 1 year ago

@alculquicondor Do we have any blocking for merging?

alculquicondor commented 1 year ago

oops, no, I just forgot to approve /approve

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)~~ [alculquicondor] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment