kubeflow / training-operator

Distributed ML Training and Fine-Tuning on Kubernetes
https://www.kubeflow.org/docs/components/training
Apache License 2.0
1.58k stars 687 forks source link

fix: incorrect initialize null replicaStatuses lead to update JobStat… #2190

Closed PeterChg closed 2 months ago

PeterChg commented 2 months ago

…us in ApiServer failed

What this PR does / why we need it:

The following problems may occur when call UpdateJobStatusInApiServer function, This causes large number of repeated retries JobReconciles:

ERROR Reconciler error {"controller": "pytorchjob-controller", "object": {"name":"test-0717","namespace":"ns-test"}, "namespace": "ns-test", "name": "tj-test-fusion-0717", "reconcileID": "2fe0485f-1a89-46d2-bf50-81eeadbd979f", "error": "PyTorchJob.kubeflow.org \"tj-test-fusion-0717\" is invalid: status.replicaStatuses: Required value"}

This problem is resolved after the initialization mode is changed.

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged): Fixes #

Checklist:

google-oss-prow[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Once this PR has been reviewed and has the lgtm label, please assign gaocegege for approval. For more information see the Kubernetes Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubeflow/training-operator/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
coveralls commented 2 months ago

Pull Request Test Coverage Report for Build 10163315038

Details


Totals Coverage Status
Change from base Build 10131041132: 0.07%
Covered Lines: 3946
Relevant Lines: 11301

💛 - Coveralls
PeterChg commented 2 months ago

Why does the integration test fail, seemingly unrelated to the code change. How do I re-launch the test /cc gaocegege

PeterChg commented 2 months ago

/rerun-all