OCPBUGS-41778: increase kube-apiserver failureThreshold

liouk commented 2 months ago

This PR improves the liveness/readiness checks of the oauth-apiserver.

In particular:

replace liveness probe path healthz with livez, and exclude etcd from the probe
increase the readiness probe failure threshold to 3

Note that there is no need to update shutdown-delay-duration and terminationGracePeriodSeconds as their values are already large enough to account for the increase in the readiness probe failure threshold (see the defaultconfig.yaml and the targetconfigcontroller.go).

openshift-ci-robot commented 2 months ago

@liouk: This pull request references Jira Issue OCPBUGS-41778, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.18.0) matches configured target version for branch (4.18.0) * bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @xingxingxia

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-kube-apiserver-operator/pull/1732): >This PR improves the liveness/readiness checks of the oauth-apiserver. > >In particular: >- set `etcd-healthcheck-timeout` and `etcd-readycheck-timeout` to 1sec less than the respective probe timeouts >- replace liveness probe path `healthz` with `livez`, and exclude `etcd` from the probe >- increase the readiness probe failure threshold to 3 > >Note that there is no need to update `shutdown-delay-duration` and `terminationGracePeriodSeconds` as their values are already large enough to account for the increase in the readiness probe failure threshold (see the [defaultconfig.yaml](https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/config/defaultconfig.yaml#L167-L168) and the [targetconfigcontroller.go](https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L432-L436)). Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-kube-apiserver-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-ci-robot commented 2 months ago

@liouk: This pull request references Jira Issue OCPBUGS-41778, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.18.0) matches configured target version for branch (4.18.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @xingxingxia

In response to [this](https://github.com/openshift/cluster-kube-apiserver-operator/pull/1732): >This PR improves the liveness/readiness checks of the oauth-apiserver. > >In particular: >- replace liveness probe path `healthz` with `livez`, and exclude `etcd` from the probe >- increase the readiness probe failure threshold to 3 > >Note that there is no need to update `shutdown-delay-duration` and `terminationGracePeriodSeconds` as their values are already large enough to account for the increase in the readiness probe failure threshold (see the [defaultconfig.yaml](https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/config/defaultconfig.yaml#L167-L168) and the [targetconfigcontroller.go](https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L432-L436)). Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-kube-apiserver-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

p0lyn0mial commented 2 months ago

/lgtm

holding off to give others time to review /hold

benluddy commented 2 months ago

/lgtm

p0lyn0mial commented 2 months ago

/hold cancel /lgtm

openshift-ci[bot] commented 2 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: benluddy, liouk, p0lyn0mial

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/OWNERS)~~ [benluddy,p0lyn0mial] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

openshift-ci-robot commented 2 months ago

/retest-required

Remaining retests: 0 against base HEAD c731008f29e6ad878389b976d5305fc8a63a6f31 and 2 for PR HEAD 13faa5faca10bb8b3501d90da071248692e932bb in total

p0lyn0mial commented 2 months ago

@deads2k FYI ci/prow/e2e-aws-ovn-upgrade failed on (might be a bug in the test):

missing acquiring stage for namespace/openshift-machine-api node/ip-10-0-90-109.ec2.internal pod/control-plane-machine-set-operator-7dc4465f5c-zz6pd uid/5704ff7b-ac83-46b1-a07c-f54a02bce1a5 container/control-plane-machine-set-operator}

p0lyn0mial commented 2 months ago

/retest-required

p0lyn0mial commented 2 months ago

@deads2k ci/prow/e2e-aws-ovn-upgrade has failed for the second time in a row on missing acquiring stage this time for namespace/openshift-cluster-storage-operator It might be an issue with the test, given how late we are in the release cycle and that the failing test is not caused by the changes introduced in this PR are you willing to manually override the failing job?

deads2k commented 2 months ago

/override ci/prow/e2e-aws-ovn-upgrade

openshift-ci[bot] commented 2 months ago

@deads2k: Overrode contexts on behalf of deads2k: ci/prow/e2e-aws-ovn-upgrade

In response to [this](https://github.com/openshift/cluster-kube-apiserver-operator/pull/1732#issuecomment-2346451124): >/override ci/prow/e2e-aws-ovn-upgrade Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

openshift-ci[bot] commented 2 months ago

@liouk: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-gcp-operator-single-node	13faa5faca10bb8b3501d90da071248692e932bb	link	false	`/test e2e-gcp-operator-single-node`
ci/prow/e2e-aws-operator-disruptive-single-node	13faa5faca10bb8b3501d90da071248692e932bb	link	false	`/test e2e-aws-operator-disruptive-single-node`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

openshift-ci-robot commented 2 months ago

@liouk: Jira Issue OCPBUGS-41778: All pull requests linked via external trackers have merged:

openshift/cluster-kube-apiserver-operator#1732

Jira Issue OCPBUGS-41778 has been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-kube-apiserver-operator/pull/1732): >This PR improves the liveness/readiness checks of the oauth-apiserver. > >In particular: >- replace liveness probe path `healthz` with `livez`, and exclude `etcd` from the probe >- increase the readiness probe failure threshold to 3 > >Note that there is no need to update `shutdown-delay-duration` and `terminationGracePeriodSeconds` as their values are already large enough to account for the increase in the readiness probe failure threshold (see the [defaultconfig.yaml](https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/bindata/assets/config/defaultconfig.yaml#L167-L168) and the [targetconfigcontroller.go](https://github.com/openshift/cluster-kube-apiserver-operator/blob/master/pkg/operator/targetconfigcontroller/targetconfigcontroller.go#L432-L436)). Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-kube-apiserver-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

p0lyn0mial commented 2 months ago

/cherry-pick release-4.17

openshift-cherrypick-robot commented 2 months ago

@p0lyn0mial: new pull request created: #1733

In response to [this](https://github.com/openshift/cluster-kube-apiserver-operator/pull/1732#issuecomment-2346469699): >/cherry-pick release-4.17 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.

openshift-bot commented 2 months ago

[ART PR BUILD NOTIFIER]

Distgit: ose-cluster-kube-apiserver-operator This PR has been included in build ose-cluster-kube-apiserver-operator-container-v4.18.0-202409121712.p0.g49d13e8.assembly.stream.el9. All builds following this will include this PR.

openshift / cluster-kube-apiserver-operator

OCPBUGS-41778: increase kube-apiserver failureThreshold #1732