kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
430 stars 216 forks source link

Update podgroups once schedulingPolicy of MPIJobs are changed #542

Closed tenzen-y closed 1 year ago

tenzen-y commented 1 year ago

Previously, we implemented the logic to create and delete podGroups for the scheduler plugins in #538.

However, I forgot to implement the logic to update podGroups.

So, I implemented the logic. Also, I cleaned up the PodGroupCtrl interface so that we can control PodGroups without passing an unneeded namespace.

tenzen-y commented 1 year ago

/assign @alculquicondor

alculquicondor commented 1 year ago

do we have the same behavior in training-operator?

Question about coscheduling plugin: I suppose that if the minMembers increases, but pods were already scheduled, they wouldn't be disrupted. But new pods will be blocked from scheduling until the new minMembers threshold is reached. Correct?

tenzen-y commented 1 year ago

do we have the same behavior in training-operator?

Yes, the training operator updates podGroups once schedulingPolicy is updated.

https://github.com/kubeflow/common/blob/83259a047fc47895dab1012728fb829523363a6d/pkg/controller.v1/common/scheduling.go#L49-L51

But new pods will be blocked from scheduling until the new minMembers threshold is reached. Correct?

Do new pods mean the case of increased worker replicas of MPIJob (scale out)?

alculquicondor commented 1 year ago

Do new pods mean the case of increased worker replicas of MPIJob (scale out)? yes

alculquicondor commented 1 year ago

/approve

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alculquicondor

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/mpi-operator/blob/master/OWNERS)~~ [alculquicondor] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tenzen-y commented 1 year ago

Do new pods mean the case of increased worker replicas of MPIJob (scale out)? yes

Correct.

alculquicondor commented 1 year ago

did you forget to push?

tenzen-y commented 1 year ago

did you forget to push?

@alculquicondor Sorry. I forgot to push.