kubeflow / common

Common APIs and libraries shared by other Kubeflow operator repositories.
Apache License 2.0
51 stars 73 forks source link

Add job suspend semantics #196

Open xiaoxubeii opened 2 years ago

xiaoxubeii commented 2 years ago

To support job suspend semantics like Kubernetes batch job: https://kubernetes.io/docs/concepts/workloads/controllers/job/#suspending-a-job

google-oss-prow[bot] commented 2 years ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: To complete the pull request process, please assign gaocegege after the PR has been reviewed. You can assign the PR to them by writing /assign @gaocegege in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/common/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
gaocegege commented 2 years ago

/ok-to-test

gaocegege commented 2 years ago

Thanks for the PR, is it ready to review?

xiaoxubeii commented 2 years ago

Thanks for the PR, is it ready to review?

@gaocegege Ready for review. Thanks :)

ggaaooppeenngg commented 1 year ago

How is this PR going now?

alculquicondor commented 1 year ago

Is this actively being worked on? Or will we get rid of the common repo first?

tenzen-y commented 1 year ago

Is this actively being worked on? Or will we get rid of the common repo first?

@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.

https://github.com/kubeflow/training-operator/issues/1683

Wed Jan 25th 2023 Week 18 Release Team Feature Freeze

https://github.com/kubeflow/community/blob/6ba2e0e754166989d2f0d06aae827ceafdb65b29/releases/release-1.7/README.md

johnugeorge commented 1 year ago

Agree. we will take this up in next release after our merging kubeflow/common as planned in https://github.com/kubeflow/training-operator/issues/1714#issuecomment-1374537434

alculquicondor commented 1 year ago

@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? It might give us a better chance to iterate faster and learn.

tenzen-y commented 1 year ago

@tenzen-y how do you feel about starting with the integration for mpi-operator v2 and follow through with training-operator later? It might give us a better chance to iterate faster and learn.

@alculquicondor Yes. that is a good idea. I was thinking of the same. Although, we need to move forward https://github.com/kubernetes-sigs/kueue/issues/369 before we adapt mpi-operator to Kueue.

alculquicondor commented 1 year ago

Excellent! We can work on the kueue side in parallel, while we add support for suspend in the mpi-operator.

tenzen-y commented 1 year ago

Excellent! We can work on the kueue side in parallel, while we add support for suspend in the mpi-operator.

You are right. I will work on the following steps after kubeflow feature freeze date (1/25) since I have no enough bandwidth for mpi-operator v2, now:

  1. https://github.com/kubeflow/mpi-operator/pull/502
  2. https://github.com/kubeflow/mpi-operator/issues/500
  3. Support suspend in mpi-operator

Although, other anyone can take step 3 after step 1 is completed.

alculquicondor commented 1 year ago

@mimowo will help with suspend in mpi-operator https://github.com/kubeflow/mpi-operator/issues/504

tenzen-y commented 1 year ago

Great! Thanks to @mimowo!

xiaoxubeii commented 1 year ago

Is this actively being worked on? Or will we get rid of the common repo first?

@alculquicondor Maybe, we will work on the Job suspend feature in the next kubeflow release cycle (maybe kubeflow v1.8?). Since we didn't push this feature to the enhancement lists for the next kubeflow release (v1.7) and the feature freeze for the next kubeflow version (v1.7) is coming up.

kubeflow/training-operator#1683

Wed Jan 25th 2023 Week 18 Release Team Feature Freeze

https://github.com/kubeflow/community/blob/6ba2e0e754166989d2f0d06aae827ceafdb65b29/releases/release-1.7/README.md

Agreed. We could try to work on Job suspend feature for kubeflow v1.8.

alculquicondor commented 1 year ago

@johnugeorge how are we doing with the branch creation? Can we proceed with this PR or move it to training-operator?