kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.61k stars 700 forks source link

KEP-2170: Add TrainJob and TrainingRuntime APIs #2223

Closed andreyvelich closed 2 months ago

andreyvelich commented 3 months ago

Fixes: https://github.com/kubeflow/training-operator/issues/2206

I added APIs for TrainJob, TrainingRuntime, and ClusterTrainingRuntime resources.

/assign @kubeflow/wg-training-leads @kannon92 @mimowo @vsoch @ahg-g @kuizhiqing @alculquicondor @zw0610 @franciscojavierarceo @shravan-achar

/hold for review.

ahg-g commented 3 months ago

@danielvegamyhre

coveralls commented 3 months ago

Pull Request Test Coverage Report for Build 10566761788

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/apis/kubeflow.org/v2alpha1/trainingruntime_types.go 0 3 0.0%
pkg/apis/kubeflow.org/v2alpha1/trainjob_types.go 0 3 0.0%
pkg/apis/kubeflow.org/v2alpha1/zz_generated.deepcopy.go 0 616 0.0%
<!-- Total: 0 622 0.0% -->
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob.go 1 91.06%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 10512072223: -1.7%
Covered Lines: 3950
Relevant Lines: 12421

💛 - Coveralls
tenzen-y commented 3 months ago

@andreyvelich First of all, could you generate / createregister.go, deepcopygen and so on by controller-tools?

Unless those functions, we can not use the API in the controllers/webhooks.

andreyvelich commented 3 months ago

@andreyvelich First of all, could you generate / createregister.go, deepcopygen and so on by controller-tools?

Unless those functions, we can not use the API in the controllers/webhooks.

I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. Does it look good @tenzen-y ?

tenzen-y commented 3 months ago

@andreyvelich First of all, could you generate / createregister.go, deepcopygen and so on by controller-tools? Unless those functions, we can not use the API in the controllers/webhooks.

I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. Does it look good @tenzen-y ?

That sounds good to me. We can create a separate issue "KEP-2170: Provide client-go library for TrainJob and TrainingRuntime".

tenzen-y commented 3 months ago

@andreyvelich First of all, could you generate / createregister.go, deepcopygen and so on by controller-tools? Unless those functions, we can not use the API in the controllers/webhooks.

I registered APIs with scheme and added deepcopygen via controller-gen. I think, we can add the defaulters and other parameters required for clients, listers, informers, in the following PRs. Does it look good @tenzen-y ?

That sounds good to me. We can create a separate issue "KEP-2170: Provide client-go library for TrainJob and TrainingRuntime".

Created: https://github.com/kubeflow/training-operator/issues/2224

andreyvelich commented 2 months ago

Are there any other comments before we can merge this PR and start working on the controller implementation ?

/assign @shravan-achar @tenzen-y @kannon92 @kuizhiqing @terrytangyuan @johnugeorge /hold cancel

google-oss-prow[bot] commented 2 months ago

@andreyvelich: GitHub didn't allow me to assign the following users: shravan-achar, kannon92.

Note that only kubeflow members with read permissions, repo collaborators and people who have commented on this issue/PR can be assigned. Additionally, issues/PRs can only have 10 assignees at the same time. For more information please see the contributor guide

In response to [this](https://github.com/kubeflow/training-operator/pull/2223#issuecomment-2302121162): >Are there any other comments before we can merge this PR and start working on the controller implementation ? > >/assign @shravan-achar @tenzen-y @kannon92 @kuizhiqing @terrytangyuan @johnugeorge >/hold cancel Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
kannon92 commented 2 months ago

LGTM on my end.

andreyvelich commented 2 months ago

We had some discussions with @tenzen-y about APIs and we proposed the following changes:

andreyvelich commented 2 months ago

/rerun-all

andreyvelich commented 2 months ago

We made the final changes with @tenzen-y:

If we don't have any followup suggestions, we can merge it.

google-oss-prow[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)~~ [tenzen-y] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tenzen-y commented 2 months ago

/hold

tenzen-y commented 2 months ago

/lgtm