Support Kubernetes v1.29 - v1.31 or v1.28 - v1.31

tenzen-y commented 1 month ago

What you would like to be added?

I would like to support the kubernetes v1.29 - v1.31, and stop the v1.27 and the v1.28 supporting before we release the final training-operator v1 version.

But, based on the v1.28 deprecation date, we may want to support the 4 Kubernetes versions (v1.28 - v1.31). @kubeflow/wg-training-leads WDYT?

What we need to do:

Upgrade the Go libs versions: https://github.com/kubeflow/training-operator/blob/6965c1a92462d46981071748936eb135a4584f3d/go.mod#L15-L22
Upgrade the CI env versions:
- https://github.com/kubeflow/training-operator/blob/6965c1a92462d46981071748936eb135a4584f3d/.github/workflows/integration-tests.yaml#L29-L55
- https://github.com/kubeflow/training-operator/blob/6965c1a92462d46981071748936eb135a4584f3d/.github/workflows/unittests.yaml#L20-L21
Upgrade the tools versions:
- https://github.com/kubeflow/training-operator/blob/6965c1a92462d46981071748936eb135a4584f3d/Makefile#L89
- https://github.com/kubeflow/training-operator/blob/6965c1a92462d46981071748936eb135a4584f3d/Makefile#L125
Switch the new code-generator: https://github.com/kubeflow/training-operator/blob/6965c1a92462d46981071748936eb135a4584f3d/hack/update-codegen.sh

Note that we should upgrade the versions step by step (1.29 -> 1.30 -> 1.31) above tasks so that we can easily revert the commit once we face the any version specific bugs and regressions.

Why is this needed?

Currently, we support the Kubernetes v1.27 - v1.29, but these versions will / have been deprecated: https://kubernetes.io/releases/ So, we should support newer versions.

EoL for 1.27: 2024-07-08
EoL for 1.28: 2024-10-28
EoL for 1.29: 2025-02-28
EoL for 1.30: 2025-06-28
EoL for 1.31: 2025-10-28

Love this feature?

Give it a 👍 We prioritize the features with most 👍

tenzen-y commented 1 month ago

/remove-label lifecycle/needs-triage

kannon92 commented 1 month ago

Could I try upgrading this?

I'd open up a PR with 1.30 first I think following your detailed plan.

tenzen-y commented 1 month ago

Could I try upgrading this?

I'd open up a PR with 1.30 first I think following your detailed plan.

Yes, we can start the v1.30 support before we decide on the scope of the supporting version. /assign @kannon92

kannon92 commented 1 month ago

I may need some guidance on the code generation.

I tried upgrading this and I ran into some problems with hack/update-codegen.sh.

# Notice: The code in code-generator does not generate defaulter by default.
# We need to build binary from vendor cmd folder.
#echo "Building defaulter-gen"
#go build -o defaulter-gen ${CODEGEN_PKG}/cmd/defaulter-gen

# $(go env GOPATH)/bin/defaulter-gen is automatically built from ${CODEGEN_PKG}/generate-groups.sh
echo "Generating defaulters for kubeflow.org/v1"
$(go env GOPATH)/bin/defaulter-gen --input-dirs github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1 \
    -O zz_generated.defaults \
    --output-package github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1 \
    --go-header-file hack/boilerplate/boilerplate.go.txt "$@" \
    --output-base "${TEMP_DIR}"

I built the default-gen but it seems that most of these arguments are not in 0.30 anymore.

0.30 does not recognize input-dirs, -O, output-package or output-base.

kannon92 commented 1 month ago

@tenzen-y let's discuss on https://github.com/kubeflow/training-operator/pull/2299.

I found it difficult to support the existing script as many of that has changed with 0.30. So I created a new script that is very similar to Kueue/JobSet.

Code generation seems to work but I am running into problems with the sdk generation.

tenzen-y commented 1 month ago

@tenzen-y let's discuss on #2299.

I found it difficult to support the existing script as many of that has changed with 0.30. So I created a new script that is very similar to Kueue/JobSet.

Code generation seems to work but I am running into problems with the sdk generation.

You may be able to learn something from the mpi-operator: https://github.com/kubeflow/mpi-operator/pull/657

kannon92 commented 1 month ago

Yes. That is a great callout.

kubeflow / training-operator