kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Make scheduler-plugins the default gang scheduler #209

Closed Syulin7 closed 1 year ago

Syulin7 commented 1 year ago

What this PR does / why we need it:

Now supports many gang schedulers(volcano, scheduler-plugins). This PR supports koordinator gang scheduler.

Related: https://github.com/kubeflow/training-operator/issues/1746

johnugeorge commented 1 year ago

Thanks @Syulin7 We are in the process of training operator release(currently rc0). Since kubeflow/common dependency has not branched out, we have to wait a bit to have the final release and then merge this one.

/cc @tenzen-y Can you review this?

tenzen-y commented 1 year ago

Thanks @Syulin7 We are in the process of training operator release(currently rc0). Since kubeflow/common dependency has not branched out, we have to wait a bit to have the final release and then merge this one.

/cc @tenzen-y Can you review this?

Sure. First, I will ask questions about the Koordinator at https://github.com/kubeflow/training-operator/issues/1746.

tenzen-y commented 1 year ago

/hold

Syulin7 commented 1 year ago

@tenzen-y This PR is ready for review. PTAL.

tenzen-y commented 1 year ago

Can you squash commits into one?

Syulin7 commented 1 year ago

Can you squash commits into one?

Done

johnugeorge commented 1 year ago

@Syulin7 We will get back to this in a week. Sorry for the delay as we haven't released 1.6 release yet.

alculquicondor commented 1 year ago

IMO, this is not sustainable. We'll keep adding support for more schedulers in the ecosystem?

schedulers need to look into generic approaches, such as the suspend field.

tenzen-y commented 1 year ago

IMO, this is not sustainable. We'll keep adding support for more schedulers in the ecosystem?

schedulers need to look into generic approaches, such as the suspend field.

@alculquicondor We don't have any criteria to support more schedulers. So we may need to have the criteria (e.g., only CNCF projects?).

alculquicondor commented 1 year ago

An alternative that I'm proposing in kubeflow/mpi-operator#500 is that koordinator implements this on a controller in their repo.

I don't see why that's not possible.

tenzen-y commented 1 year ago

An alternative that I'm proposing in kubeflow/mpi-operator#500 is that koordinator implements this on a controller in their repo.

I don't see why that's not possible.

It looks like you have a point. However, this issue applies to volcano and scheduler-plugins as well, not just to the koordinator.

alculquicondor commented 1 year ago

Yes, it applies to all.

If the maintainers of kubeflow agree, we would have to:

  1. Open issues in volcano and scheduling-plugins to implement the logic in their side.
  2. Make it possible to disable the PodGroup control in kubeflow (I believe this is possible by just leaving the gang-scheduler parameter empty).
  3. Remove the code after a few releases.
alculquicondor commented 1 year ago

If this works out well, we could justify:

  1. Adding similar fields to kubeflow's schedulingPolicy in batch/v1.Job
  2. https://github.com/kubernetes/kubernetes/issues/107294, which all schedulers can use to generically group pods from a custom job.
tenzen-y commented 1 year ago

If this works out well, we could justify:

  1. Adding similar fields to kubeflow's schedulingPolicy in batch/v1.Job
  2. Add suspend subresource kubernetes/kubernetes#107294, which all schedulers can use to generically group pods from a custom job.

Totally, I agree. However, since we may need long terms to complete all steps, I would like to add coscheduling support to MPIJob for the short term.

tenzen-y commented 1 year ago

Yes, it applies to all.

If the maintainers of kubeflow agree, we would have to:

  1. Open issues in volcano and scheduling-plugins to implement the logic in their side.
  2. Make it possible to disable the PodGroup control in kubeflow (I believe this is possible by just leaving the gang-scheduler parameter empty).
  3. Remove the code after a few releases.

Ah, I missed the above comments. I'm ok with having the logic for PodGroup in the scheduler-plugins repo. If the scheduler-plugins team agrees with this, I can help to implement the logic on their repo.

alculquicondor commented 1 year ago

For the time being, I would vote for only keeping support for the schedulers we already have (scheduler-plugins and volcano).

I would encourage the maintainers of Koordinator to put their efforts in https://github.com/kubernetes/kubernetes/issues/107294 and https://github.com/kubernetes/enhancements/issues/3370 instead of adding more dependencies in kubeflow.

Syulin7 commented 1 year ago

@alculquicondor I understand your concerns, and it make sense. As I mentioned in https://github.com/kubeflow/training-operator/issues/1746, the Koordinator scheduler uses the PodGroup defined in scheduler-plugins/coscheduling. Therefore, supporting Koordinator in the training-operator will not add more dependencies, but rather provide another option. Currently, some people are using both the training-operator and Koordinator simultaneously, which is very useful for them.

In the long run, I think it's necessary to provide a Native PodGroup API that includes fields such as suspend, queue, minNumOfWorkers, etc. Different schedulers can implement the same Native PodGroup API. For example, the operator only needs to add a specific label to the pod, and the scheduler creates and maintains the PodGroup. This way, the operator and scheduler won't depend on each other. I think we can work towards this direction in the future, but supporting Koordinator in the training-operator currently will not affect this.

@alculquicondor @tenzen-y @terrytangyuan WDYT

tenzen-y commented 1 year ago

Therefore, supporting Koordinator in the training-operator will not add more dependencies, but rather provide another option.

That's right for now. However, the training-operator must directly depend on the koordinator if the koordinator-scheduler has original logic or CRDs in the future.

However, fortunately, the users can deploy CustomJobs with koordinator-scheduler by specifying scheduler-name even if we don't add this feature to the master branch as I mentioned in https://github.com/kubeflow/training-operator/issues/1746#issuecomment-1408364031.

So I agree with @alculquicondor.

Syulin7 commented 1 year ago

@tenzen-y @alculquicondor This PR is ready for review. PTAL.

I set SchedulerPluginsControl as the default implementation. Removed koordinator related code, set --gang-scheduler-name=koord-scheduler can support Koordinator Gang Scheduler.

The test results are as follows.

$ kubectl get pods -n kubeflow
NAME                    READY   STATUS    RESTARTS   AGE
tfjob-simple-worker-0   0/1     Pending   0          82s
tfjob-simple-worker-1   0/1     Pending   0          82s
$ kubectl describe pods -n kubeflow tfjob-simple-worker-0 | grep -A10  Events
Events:
  Type     Reason            Age   From             Message
  ----     ------            ----  ----             -------
  Warning  FailedScheduling  16m   koord-scheduler  running PreFilter plugin "Coscheduling": gang child pod not collect enough, gangName: kubeflow/tfjob-simple, podName: kubeflow/tfjob-simple-worker-0
  Warning  FailedScheduling  16m   koord-scheduler  running PreFilter plugin "Coscheduling": gang child pod not collect enough, gangName: kubeflow/tfjob-simple, podName: kubeflow/tfjob-simple-worker-0
Syulin7 commented 1 year ago

I resolved all conversations, PTAL, thanks. @tenzen-y @alculquicondor

alculquicondor commented 1 year ago

/retitle Make scheduler-plugins the default gang scheduler

Syulin7 commented 1 year ago

@alculquicondor @tenzen-y @terrytangyuan all conversations are done, PTAL.

tenzen-y commented 1 year ago

I left a comment for nit. Overall, lgtm.

tenzen-y commented 1 year ago

@Syulin7 Thanks for the great work! /lgtm

@johnugeorge Do you decide on the week for releasing the training operator v1.6, although releasing kubeflow v1.7 seems delayed?

Week 23 - Distribution Testing Ends: Wednesday, Mar 22nd 2023

alculquicondor commented 1 year ago

/lgtm

google-oss-prow[bot] commented 1 year ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: terrytangyuan

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/common/blob/master/OWNERS)~~ [terrytangyuan] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
johnugeorge commented 1 year ago

@tenzen-y The final release is not yet made due to delay in Kubeflow upstream. Sorry for the delay. This should not happen again in the future as we always branch out for all repos except common. We will target this week end to merge- Friday 24 March

tenzen-y commented 1 year ago

@tenzen-y The final release is not yet made due to delay in Kubeflow upstream. Sorry for the delay. This should not happen again in the future as we always branch out for all repos except common. We will target this week end to merge- Friday 24 March

Thanks for letting me know! I appreciate your hard work!

johnugeorge commented 1 year ago

Training operator release is made. This can be merged.

Thanks for the patience

tenzen-y commented 1 year ago

@Syulin7 Thank you! /hold cancel

tenzen-y commented 1 year ago

@johnugeorge Can we cut a new release on this repo to move forward https://github.com/kubeflow/training-operator/pull/1747?