kubeflow / katib

Automated Machine Learning on Kubernetes
https://www.kubeflow.org/docs/components/katib
Apache License 2.0
1.45k stars 425 forks source link

Fix TestReconcileBatchJob #2350

Closed forsaken628 closed 2 weeks ago

forsaken628 commented 2 weeks ago

What this PR does / why we need it: Fix unstable tests Which issue(s) this PR fixes _(optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged)_:

1649

Checklist:

tenzen-y commented 2 weeks ago

@forsaken628 Thank you for this contribution. Could you explain the reason why the flakiness happened and what you fixed the root cause in the PR description?

forsaken628 commented 2 weeks ago

According to the previous logic,call GetTrialObservationLog will return observationLogAvailable once and then observationLogUnavailable multiple times.

I investigated the cause of the failure in the ci environment: in test 2, which should have returned observationLogAvailable, it actually returned observationLogUnavailable, indicating that GetTrialObservationLog was accidentally called once somewhere.

Instead of locating where the accident occurred, I rewrote mockManagerClient to always return the same reply before calling DeleteTrialObservationLog, which I think is consistent with the semantics of ManagerClient.

tenzen-y commented 2 weeks ago

According to the previous logic,call GetTrialObservationLog will return observationLogAvailable once and then observationLogUnavailable multiple times.

I investigated the cause of the failure in the ci environment: in test 2, which should have returned observationLogAvailable, it actually returned observationLogUnavailable, indicating that GetTrialObservationLog was accidentally called once somewhere.

Instead of locating where the accident occurred, I rewrote mockManagerClient to always return the same reply before calling DeleteTrialObservationLog, which I think is consistent with the semantics of ManagerClient.

Thank you for clarifying the root cause. It seems that you split out multiple test case into dedicated case based on the result of the GetTrialObservationLog. That sounds reasonable.

google-oss-prow[bot] commented 2 weeks ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: tenzen-y

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/kubeflow/katib/blob/master/OWNERS)~~ [tenzen-y] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
andreyvelich commented 2 weeks ago

Thank you for this amazing contribution @forsaken628! I really hope it will improve our unstable tests.