kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

Upgrade 1.30 #2299

Closed kannon92 closed 4 hours ago

kannon92 commented 1 month ago

What this PR does / why we need it: Upgrade Kubernetes to 1.30 Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes # Partially fix #2291 Checklist:

google-oss-prow[bot] commented 1 month ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign tenzen-y for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
kannon92 commented 1 month ago

@tenzen-y

I am having trouble with the python sdk generation.

Downloading the swagger-codegen JAR package ...
Internal error: Unexpected relative path: 'hack/python-sdk/..'
kannon92 commented 1 month ago

@tenzen-y @andreyvelich can I get approval to run the tests?

andreyvelich commented 1 month ago

/ok-to-test /rerun-all

kannon92 commented 1 month ago

/hold

This is still in progress.

kannon92 commented 1 month ago

/cc @tenzen-y

tenzen-y commented 1 month ago

He does not have enough time this week. So, he will try to fix some errors in the next week.

kannon92 commented 4 weeks ago

@tenzen-y here is where I left off.

For JaxJob I made some progress on the generics but I am stuck on the predicate functions.

I am getting the following error:

cannot use r.onOwnerCreateFunc() (value of type func(event.TypedCreateEvent[client.Object]) bool) as func(event.TypedCreateEvent[*"github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1".JAXJob]) bool value in struct literal

I was able to resolve the handler but all the usage of predicates is causing some build errors.

    // using onOwnerCreateFunc is easier to set defaults
    if err = c.Watch(source.Kind(mgr.GetCache(), &kubeflowv1.JAXJob{}, &handler.TypedEnqueueRequestForObject[*kubeflowv1.JAXJob]{},
        predicate.TypedFuncs[*kubeflowv1.JAXJob]{CreateFunc: r.onOwnerCreateFunc()})); err != nil {
        return err
    }

    predicates := predicate.Funcs{
        CreateFunc: util.OnDependentCreateFunc(r.Expectations),
        UpdateFunc: util.OnDependentUpdateFunc(&r.JobController),
        DeleteFunc: util.OnDependentDeleteFunc(r.Expectations),
    }
    // Create generic predicates
    genericPredicates := predicate.Funcs{
        CreateFunc: util.OnDependentCreateFuncGeneric(r.Expectations),
        UpdateFunc: util.OnDependentUpdateFuncGeneric(&r.JobController),
        DeleteFunc: util.OnDependentDeleteFuncGeneric(r.Expectations),
    }
    // inject watching for job related pod
    if err = c.Watch(source.Kind(mgr.GetCache(), &corev1.Pod{}, &handler.TypedEnqueueRequestForObject[*corev1.Pod]{}, predicates)); err != nil {
        return err
    }
    // inject watching for job related service
    if err = c.Watch(source.Kind(mgr.GetCache(), &corev1.Service{}, &handler.TypedEnqueueRequestForObject[*corev1.Service]{}, predicates)); err != nil {
        return err
    }
    // skip watching volcano PodGroup if volcano PodGroup is not installed
    if _, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: v1beta1.GroupName, Kind: "PodGroup"},
        v1beta1.SchemeGroupVersion.Version); err == nil {
        // inject watching for job related volcano PodGroup
        if err = c.Watch(source.Kind(mgr.GetCache(), &v1beta1.PodGroup{}, &handler.TypedEnqueueRequestForObject[*v1beta1.PodGroup]{}, genericPredicates)); err != nil {
            return err
        }
    }

I am not sure of the path forward for this as generics are not my strong suite.

tenzen-y commented 4 weeks ago

@tenzen-y here is where I left off.

For JaxJob I made some progress on the generics but I am stuck on the predicate functions.

I am getting the following error:

cannot use r.onOwnerCreateFunc() (value of type func(event.TypedCreateEvent[client.Object]) bool) as func(event.TypedCreateEvent[*"github.com/kubeflow/training-operator/pkg/apis/kubeflow.org/v1".JAXJob]) bool value in struct literal

I was able to resolve the handler but all the usage of predicates is causing some build errors.

  // using onOwnerCreateFunc is easier to set defaults
  if err = c.Watch(source.Kind(mgr.GetCache(), &kubeflowv1.JAXJob{}, &handler.TypedEnqueueRequestForObject[*kubeflowv1.JAXJob]{},
      predicate.TypedFuncs[*kubeflowv1.JAXJob]{CreateFunc: r.onOwnerCreateFunc()})); err != nil {
      return err
  }

  predicates := predicate.Funcs{
      CreateFunc: util.OnDependentCreateFunc(r.Expectations),
      UpdateFunc: util.OnDependentUpdateFunc(&r.JobController),
      DeleteFunc: util.OnDependentDeleteFunc(r.Expectations),
  }
  // Create generic predicates
  genericPredicates := predicate.Funcs{
      CreateFunc: util.OnDependentCreateFuncGeneric(r.Expectations),
      UpdateFunc: util.OnDependentUpdateFuncGeneric(&r.JobController),
      DeleteFunc: util.OnDependentDeleteFuncGeneric(r.Expectations),
  }
  // inject watching for job related pod
  if err = c.Watch(source.Kind(mgr.GetCache(), &corev1.Pod{}, &handler.TypedEnqueueRequestForObject[*corev1.Pod]{}, predicates)); err != nil {
      return err
  }
  // inject watching for job related service
  if err = c.Watch(source.Kind(mgr.GetCache(), &corev1.Service{}, &handler.TypedEnqueueRequestForObject[*corev1.Service]{}, predicates)); err != nil {
      return err
  }
  // skip watching volcano PodGroup if volcano PodGroup is not installed
  if _, err = mgr.GetRESTMapper().RESTMapping(schema.GroupKind{Group: v1beta1.GroupName, Kind: "PodGroup"},
      v1beta1.SchemeGroupVersion.Version); err == nil {
      // inject watching for job related volcano PodGroup
      if err = c.Watch(source.Kind(mgr.GetCache(), &v1beta1.PodGroup{}, &handler.TypedEnqueueRequestForObject[*v1beta1.PodGroup]{}, genericPredicates)); err != nil {
          return err
      }
  }

I am not sure of the path forward for this as generics are not my strong suite.

Thank you for letting me know. I will try to investigate how we can migrate to new functions.

kannon92 commented 4 hours ago

I wasn't able to focus on this as I thought. https://github.com/kubeflow/training-operator/pull/2332 is open.

/close