kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.62k stars 701 forks source link

[Bug] Finish CleanupJob early if the job is suspended. #2243

Closed mszadkow closed 3 months ago

mszadkow commented 3 months ago

What this PR does / why we need it: To fix the bug related to the situation when the job was both suspendedandrunPolicy.ttlSecondsAfterFinished` was set. In such situation CleanupJob was returning an error and activate status of the replicas couldn't be removed.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #2239

Checklist:

mszadkow commented 3 months ago

cc @tenzen-y

mszadkow commented 3 months ago

cc @alculquicondor

google-oss-prow[bot] commented 3 months ago

@mszadkow: Cannot trigger testing until a trusted user reviews the PR and leaves an /ok-to-test message.

In response to [this](https://github.com/kubeflow/training-operator/pull/2243#issuecomment-2318385420): >/ok-to-test Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
coveralls commented 3 months ago

Pull Request Test Coverage Report for Build 10631502623

Details


Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controller.v1/common/job.go 0 1 0.0%
<!-- Total: 0 1 0.0% -->
Files with Coverage Reduction New Missed Lines %
pkg/controller.v1/mpi/mpijob.go 1 91.06%
<!-- Total: 1 -->
Totals Coverage Status
Change from base Build 10600609425: 0.06%
Covered Lines: 3950
Relevant Lines: 12421

💛 - Coveralls
google-oss-prow[bot] commented 3 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)~~ [tenzen-y] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment