kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 700 forks source link

KEP-2170: Add TrainJob conditions #2322

Closed tenzen-y closed 2 weeks ago

tenzen-y commented 2 weeks ago

What this PR does / why we need it: I implemented the TrainJob condition mechanism based on https://github.com/kubeflow/training-operator/tree/master/docs/proposals/2170-kubeflow-training-v2#state-transition

However, the current implementation depends on the JobSet status.conditions as opposed to the status.terminalState since the terminalState was introduced in JobSet v0.6, then the JobSet depends on the K8s lib. After we upgrade the training-operator dep version to 1.30 in https://github.com/kubeflow/training-operator/pull/2299, we can rely on the termonalState.

So, after we upgrade the K8s libs to 1.30, we can revisit the JobSet status.terminalState.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Part-of: https://github.com/kubeflow/training-operator/issues/2207 Relates to #2170

Checklist:

coveralls commented 2 weeks ago

Pull Request Test Coverage Report for Build 11754225694

Details


Totals Coverage Status
Change from base Build 11663764609: 0.0%
Covered Lines: 77
Relevant Lines: 77

💛 - Coveralls
tenzen-y commented 2 weeks ago

/hold for review

tenzen-y commented 2 weeks ago

/assign @kubeflow/wg-training-leads

tenzen-y commented 2 weeks ago

@andreyvelich I addressed all comments. PTAL, thanks!

andreyvelich commented 2 weeks ago

Thanks @tenzen-y! /lgtm /approve /hold

Feel free to merge it.

google-oss-prow[bot] commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)~~ [andreyvelich] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
tenzen-y commented 2 weeks ago

Thank you for the review! /hold cancel