kubeflow / mpi-operator

Kubernetes Operator for MPI-based applications (distributed training, HPC, etc.)
https://www.kubeflow.org/docs/components/training/mpi/
Apache License 2.0
420 stars 211 forks source link

Release 0.4.0 #507

Closed tenzen-y closed 1 year ago

tenzen-y commented 1 year ago

Maybe we want to cut a new mpi-operator release once we have completed the following tasks:

tenzen-y commented 1 year ago

/cc @terrytangyuan @alculquicondor @gaocegege @zw0610

terrytangyuan commented 1 year ago

Sounds good to me

ByronHsu commented 1 year ago

@tenzen-y @terrytangyuan Wondering what is the estimated release date for this task? Our company depends on mpi-operator v2. I can also help on a few if needed :)

terrytangyuan commented 1 year ago

@tenzen-y @alculquicondor Any estimates on those pending issues? Perhaps @ByronHsu could help some of those.

tenzen-y commented 1 year ago

@ByronHsu We have yet to set a release date for 0.4.0. However, progress has been good.

I can also help on a few if needed

Thanks.

500 and #518 have almost been completed (https://github.com/tenzen-y/mpi-operator/tree/support-scheduler-plugins).

Also, we can not work on #505 yet since this issue depends on https://github.com/kubernetes-sigs/kueue/issues/360.

However, I'm open to other tasks not mentioned above!

alculquicondor commented 1 year ago

we can leave #505 to the kueue repo as well

tenzen-y commented 1 year ago

As another option, we might be able to include kueue related enhancements after the 0.4.0 release (0.5.0?)

ByronHsu commented 1 year ago

Sounds good! Thanks for the amazing effort!

tenzen-y commented 1 year ago

It would be better to include #521 in MPI Operator v0.4.0.

mimowo commented 1 year ago

Releasing 0.4.0 will help for the Kueue-MPI integration: https://github.com/kubernetes-sigs/kueue/issues/65. With the decision that the integration is happening inside Kueue we need to have a dependency on the mpi-operator. For now, I draft (https://github.com/kubernetes-sigs/kueue/pull/578) the integration using the master of the mpi-operator, so it is not blocking progress, but at some point we need to switch.

cc @alculquicondor @mwielgus

alculquicondor commented 1 year ago

We are pretty much ready for a release.

@terrytangyuan how can we do a release? I remember we had to upload images, but now I think that's not necessary. Although tags might still be necessary. What else do we need?

alculquicondor commented 1 year ago

Ah, this also needs to be updated https://github.com/kubeflow/mpi-operator/blob/master/RELEASE.md

@tenzen-y could you take it?

tenzen-y commented 1 year ago

We are pretty much ready for a release.

@terrytangyuan how can we do a release? I remember we had to upload images, but now I think that's not necessary. Although tags might still be necessary. What else do we need?

@alculquicondor We also need to add e2e for the coscheduling plugins (#500) before releasing v0.4.0. So I will update the change log once implementing e2e is done.

terrytangyuan commented 1 year ago

We should release through GitHub Release (in the UI). Yes please update the release notes.

tenzen-y commented 1 year ago

Note that: Probably, we need to create CI pipelines to build example images or manually build those images on our local machine and push the registry before we cut a new release.

terrytangyuan commented 1 year ago

Yep those should be automated. Here's a reference GitHub Action that we can borrow, e.g. docker image push and GitHub release. https://github.com/argoproj/argo-workflows/blob/master/.github/workflows/release.yaml

tenzen-y commented 1 year ago

created a issue: #541

alculquicondor commented 1 year ago

Can we manually create the images for this release?

Are we missing anything else for the release?

tenzen-y commented 1 year ago

Can we manually create the images for this release?

I don't have permission to publish images to Dockerhub, although building images on my locally is possible.

Are we missing anything else for the release?

I'm working on fixing the below bug:

Oh, this is a bug...
I will create a separate PR to fix that.

W0403 20:47:56.968863   15661 podgroup.go:314] Ignore replica "Launcher" priority class "non-existence": priorityclass.scheduling.k8s.io "non-existence" not found
    podgroup_test.go:624: Unexpected calculatePGMinResources for the scheduler-plugins (-want,+got):
          &v1.ResourceList{
        -   s"cpu":    {i: resource.int64Amount{value: 7}, s: "7", Format: "DecimalSI"},
        +   s"cpu":    {i: resource.int64Amount{value: 12}, Format: "DecimalSI"},
        -   s"memory": {i: resource.int64Amount{value: 19327352832}, s: "18Gi", Format: "BinarySI"},
        +   s"memory": {i: resource.int64Amount{value: 36507222016}, Format: "BinarySI"},
          }
https://github.com/kubeflow/mpi-operator/actions/runs/4601155665/jobs/8128664833?pr=540#step:8:208

https://github.com/kubeflow/mpi-operator/pull/540#issuecomment-1496012813

And also, we might need to create CHANGELOG, as you mentioned.

alculquicondor commented 1 year ago

I do have permissions. Once you give me the green light, I could build and upload.

tenzen-y commented 1 year ago

I do have permissions. Once you give me the green light, I could build and upload.

Great!

Note that to support the multi-architectures, we must specify the platforms when we build the operator image:

$ make images PLATFORMS=linux/amd64,linux/arm64,linux/ppc64le
alculquicondor commented 1 year ago

Also need to run with IMG_BUILDER="docker buildx". However, the base images need some versioning. I'll work on this tomorrow.

tenzen-y commented 1 year ago

Released v0.4.0 🎉

https://github.com/kubeflow/mpi-operator/releases/tag/v0.4.0

Only https://github.com/kubeflow/website/pull/3453 remains.

tenzen-y commented 1 year ago

All tasks are completed! Thanks to everyone!

/close

google-oss-prow[bot] commented 1 year ago

@tenzen-y: Closing this issue.

In response to [this](https://github.com/kubeflow/mpi-operator/issues/507#issuecomment-1499640173): >All tasks are completed! >Thanks to everyone! > >/close > Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.