kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.6k stars 696 forks source link

KEP-2170: Migrate the container resource calculation mechanism to k/k library #2280

Open tenzen-y opened 2 weeks ago

tenzen-y commented 2 weeks ago

What you would like to be added?

Currently, we depend on the Kueue container resource request calculation mechanism since the k/k mechanism is not exposed to the third part repositories.

https://github.com/kubeflow/training-operator/blob/22da8af373f18e8e51c1e466a35a7738340b8c7d/pkg/runtime.v2/runtime.go#L124-L125

But, after the k/k v1.32, we can use the container resource computation library in any repositories: https://github.com/kubernetes/kubernetes/pull/124609

Why is this needed?

Reducing dependencies would be better.

Love this feature?

Give it a 👍 We prioritize the features with most 👍

kannon92 commented 2 weeks ago

How would this work with version skew?

Do we migrate to this once 1.32 is the last version supported?

tenzen-y commented 2 weeks ago

How would this work with version skew?

Do we migrate to this once 1.32 is the last version supported?

I will switch to k/k library once 1.32.0 is released.

tenzen-y commented 2 weeks ago

How would this work with version skew? Do we migrate to this once 1.32 is the last version supported?

I will switch to k/k library once 1.32.0 is released.

But, we can use the 1.32.0 after the next kubeflow release since the v1.32 is out of the scope of the next kubeflow release.

kannon92 commented 2 weeks ago

sorry I am a bit confused.

Are you saying that if I am running k8s 1.31, kubeflow (once upgraded to v1.32) is not supported in this case?

I can install kubeflow on various versions of kubernetes so I was thinking that if 1.29, 1.30 and 1.31 are still in support, do we need to make sure that Kubeflow can be installed/functional on those?

I guess for v2, we could say that this only supported on k8s 1.32 and on.

tenzen-y commented 2 weeks ago

As you can see here: https://www.kubeflow.org/docs/releases/kubeflow-1.9/ The supported Kubernetes version is defined by the release team, and the version is often a little bit older based on the previous kubeflow releases.

So, I guess that the next supported versions are 1.29 - 1.31.

tenzen-y commented 2 weeks ago

I guess for v2, we could say that this only supported on k8s 1.32 and on.

I guess that this is so challenging since we need to prepare additional verification infrastructure and go modules and more. Additionally, if we want to do it, we need to offer the different supported versions to kubeflow vendors (https://www.kubeflow.org/docs/started/installing-kubeflow/#packaged-distributions).

andreyvelich commented 1 week ago

/remove-label lifecycle/needs-triage /kind cleanup