fix: FailureThreshold and SuccessThreshold not take effect

FengXingYuXin commented 2 years ago

What this PR does / why we need it: cluster health check's threshold config not take effect. Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged): Fixes #1496

Special notes for your reviewer:

linux-foundation-easycla[bot] commented 2 years ago

The committers listed above are authorized under a signed CLA.

:white_check_mark: login: FengXingYuXin / name: FengXingYuXin (b85d3904ea6e657533182ec71d09fcd5982695c8)

k8s-ci-robot commented 2 years ago

Welcome @FengXingYuXin!

It looks like this is your first PR to kubernetes-sigs/kubefed 🎉. Please refer to our pull request process documentation to help your PR have a smooth ride to approval.

You will be prompted by a bot to use commands during the review process. Do not be afraid to follow the prompts! It is okay to experiment. Here is the bot commands documentation.

You can also check if kubernetes-sigs/kubefed has its own contribution guidelines.

You may want to refer to our testing guide if you run into trouble with your tests not passing.

If you are having difficulty getting your pull request seen, please follow the recommended escalation practices. Also, for tips and tricks in the contribution process you may want to read the Kubernetes contributor cheat sheet. We want to make sure your contribution gets all the attention it needs!

Thank you, and welcome to Kubernetes. :smiley:

k8s-ci-robot commented 2 years ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: FengXingYuXin To complete the pull request process, please assign hectorj2f after the PR has been reviewed. You can assign the PR to them by writing /assign @hectorj2f in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/kubernetes-sigs/kubefed/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

RainbowMango commented 2 years ago

cc @irfanurrehman @hectorj2f Could you please take a look?

irfanurrehman commented 2 years ago

@FengXingYuXin Thanks for doing this. Will it be possible for you to add some kind of a test for this change?

irfanurrehman commented 2 years ago

@FengXingYuXin Thanks for doing this. Will it be possible for you to add some kind of a test for this change?

@FengXingYuXin nevermind, there are unit tests for this and they seem to fail with your change. Please take a look.

FengXingYuXin commented 2 years ago

@FengXingYuXin Thanks for doing this. Will it be possible for you to add some kind of a test for this change?

@FengXingYuXin nevermind, there are unit tests for this and they seem to fail with your change. Please take a look.

@irfanurrehman thanks for your reply and remind, and I have fix the ut cases, please check it in your own good time.

FengXingYuXin commented 2 years ago

@irfanurrehman about the moment of status translates, for example, if fail threshold is 3, as I understand it, if it's better for translating status from ready to notReady when probe failed status on the third time continuously, now the existing code translates status on the 4th time. If you agree with it, I can adjust it later.

irfanurrehman commented 2 years ago

@irfanurrehman about the moment of status translates, for example, if fail threshold is 3, as I understand it, if it's better for translating status from ready to notReady when probe failed status on the third time continuously, now the existing code translates status on the 4th time. If you agree with it, I can adjust it later.

@FengXingYuXin Apologies for late reply. I had a chance to look at your changes this weekend. Thanks for the same. I meanwhile find the change a little unclean. I also agree that this portion of code might do with an overhaul and the logic in thresholdAdjustedClusterStatus() can be rewritten to make it simpler. To only fix the issue you raised , I implemented a quick fix which seems to work fine with the existing test cases without changing them and also a new test case that I did add to address your issue. If that seems fine to you, please pull the changes from here into your PR.

If you however interested in rewriting the logic and make it easy to understand and more maintainable I recommend below: Keep the current ClusterData.clusterStatus field as is and use it for last sampling data updated each time.

Introduce a new field in cluster data:

    // clusterStatus of the last observed transition.
    transitionStatus *fedv1b1.KubeFedClusterStatus

and use that to store the observed transition when it is observed the first time. Recommendation on initiation of the code below (you will need to complete the logic and update the tests accordingly):


    if storedData.clusterStatus == nil {
        storedData.resultRun = 1
        return clusterStatus
    }

    threshold := clusterHealthCheckConfig.FailureThreshold
    if util.IsClusterReady(clusterStatus) {
        threshold = clusterHealthCheckConfig.SuccessThreshold
    }

    if !clusterStatusEqual(clusterStatus, storedData.clusterStatus) {
    // We observe a transition
        if storedData.transitionStatus == nil {
            // This is the first time we observe the transition
            storedData.transitionStatus = clusterStatus
            storedData.resultRun = 1
        }
        if storedData.resultRun < threshold {
            // Success/Failure is below threshold - leave the probe state unchanged.
            probeTime := clusterStatus.Conditions[0].LastProbeTime
            clusterStatus = storedData.clusterStatus
            setProbeTime(clusterStatus, probeTime)
            if storedData.transitionStatus != nil {
                storedData.resultRun++
            }
        } 
    } else {
        storedData.resultRun++
    }

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot commented 2 years ago

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot commented 2 years ago

@k8s-triage-robot: Closed this PR.

In response to [this](https://github.com/kubernetes-sigs/kubefed/pull/1497#issuecomment-1240774006): >The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. > >This bot triages issues and PRs according to the following rules: >- After 90d of inactivity, `lifecycle/stale` is applied >- After 30d of inactivity since `lifecycle/stale` was applied, `lifecycle/rotten` is applied >- After 30d of inactivity since `lifecycle/rotten` was applied, the issue is closed > >You can: >- Reopen this issue or PR with `/reopen` >- Mark this issue or PR as fresh with `/remove-lifecycle rotten` >- Offer to help out with [Issue Triage][1] > >Please send feedback to sig-contributor-experience at [kubernetes/community](https://github.com/kubernetes/community). > >/close > >[1]: https://www.kubernetes.dev/docs/guide/issue-triage/ Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.

kubernetes-retired / kubefed

fix: FailureThreshold and SuccessThreshold not take effect #1497