OCPBUGS-32510: change metrics-server probes for SNO

simonpasquier commented 5 months ago

This change switches the metrics-server's readiness probe to use the /livez endpoint instead of /readyz for single-node deployments.

By default, the /readyz endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice).

In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the /livez endpoint in this mode.

The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change.

[ ] I added CHANGELOG entry for this change.
[X] No user facing changes, so no entry in CHANGELOG was needed.

openshift-ci-robot commented 5 months ago

@simonpasquier: This pull request references Jira Issue OCPBUGS-32510, which is valid.

3 validation(s) were run on this bug

* bug is open, matching expected state (open) * bug target version (4.16.0) matches configured target version for branch (4.16.0) * bug is in the state POST, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact: /cc @juzhao

The bug has been updated to refer to the pull request using the external bug tracker.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2337): >This change switches the metrics-server's readiness probe to use the `/livez` endpoint instead of `/readyz` for single-node deployments. > >By default, the `/readyz` endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice). > >In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the `/livez` endpoint in this mode. > >The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change. > > > >* [ ] I added CHANGELOG entry for this change. >* [X] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

simonpasquier commented 5 months ago

/assign @machine424 /assign @jan--f /assign @slashpai

slashpai commented 5 months ago

/retest

simonpasquier commented 5 months ago

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node https://github.com/openshift/api/pull/1878

openshift-ci[bot] commented 5 months ago

@simonpasquier: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

machine424 commented 5 months ago

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

openshift-ci[bot] commented 5 months ago

@machine424: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/a33270c0-0bb6-11ef-9260-f04f1bc5bec1-0

slashpai commented 5 months ago

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

openshift-ci[bot] commented 5 months ago

@slashpai: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/14e95a20-0bb8-11ef-8bea-fa1b23220f91-0

openshift-ci[bot] commented 5 months ago

@simonpasquier: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/versions	f9670c7cbddc3595362c6e8e175a412f3aad706d	link	false	`/test versions`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).

simonpasquier commented 5 months ago

/hold

slashpai commented 5 months ago

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node https://github.com/openshift/api/pull/1865

openshift-ci[bot] commented 5 months ago

@slashpai: it appears that you have attempted to use some version of the payload command, but your comment was incorrectly formatted and cannot be acted upon. See the docs for usage info.

machine424 commented 5 months ago

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

machine424 commented 5 months ago

(even though we already got a green here https://github.com/openshift/cluster-monitoring-operator/pull/2337#issuecomment-2096201375 and no changes were pushed later. The test is failing on https://github.com/openshift/cluster-monitoring-operator/pull/2337#issuecomment-2096222403 because of unrelated etcd events)

simonpasquier commented 5 months ago

/skip

simonpasquier commented 5 months ago

/skip

juzhao commented 5 months ago

tested with PR launch 4.16.0-0.nightly-2024-05-07-025557,openshift/cluster-monitoring-operator#2337 aws,single-node readinessProbe path changed from /readyz to /livez and startupProbe is added

$ oc -n openshift-monitoring get pod metrics-server-5cc4cd5f75-5nshz -oyaml
...
    livenessProbe:
      failureThreshold: 3
      httpGet:
        path: /livez
        port: https
        scheme: HTTPS
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 1
    name: metrics-server
    ports:
    - containerPort: 10250
      name: https
      protocol: TCP
    readinessProbe:
      failureThreshold: 6
      httpGet:
        path: /livez
        port: https
        scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 20
      successThreshold: 1
      timeoutSeconds: 1
    resources:
      requests:
        cpu: 1m
        memory: 40Mi
    securityContext:
      allowPrivilegeEscalation: false
      capabilities:
        drop:
        - ALL
      readOnlyRootFilesystem: true
      runAsNonRoot: true
      runAsUser: 1000450000
    startupProbe:
      failureThreshold: 6
      httpGet:
        path: /readyz
        port: https
        scheme: HTTPS
      initialDelaySeconds: 20
      periodSeconds: 20
      successThreshold: 1
      timeoutSeconds: 1

/label qe-approved

machine424 commented 5 months ago

/lgtm

openshift-ci[bot] commented 5 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: machine424, simonpasquier

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-monitoring-operator/blob/master/OWNERS)~~ [machine424,simonpasquier] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment

slashpai commented 5 months ago

/payload-job-with-prs periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node openshift/api#1865

openshift-ci[bot] commented 5 months ago

@slashpai: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-nightly-4.16-e2e-aws-ovn-single-node

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/5a5eb190-1117-11ef-93e5-df91624bf14d-0

simonpasquier commented 5 months ago

/hold cancel

openshift-ci-robot commented 5 months ago

@simonpasquier: Jira Issue OCPBUGS-32510: Some pull requests linked via external trackers have merged:

The following pull requests linked via external trackers have not merged:

openshift/cluster-monitoring-operator#2329 is open

These pull request must merge or be unlinked from the Jira bug in order for it to move to the next state. Once unlinked, request a bug refresh with /jira refresh.

Jira Issue OCPBUGS-32510 has not been moved to the MODIFIED state.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2337): >This change switches the metrics-server's readiness probe to use the `/livez` endpoint instead of `/readyz` for single-node deployments. > >By default, the `/readyz` endpoint is used to assert the component readiness. This endpoint returns success when the metrics-server has metric samples over 2 intervals (e.g. it has scraped at least one kubelet twice). > >In single-node deployments, it happens sometimes (especially in end-to-end tests) that the kubelet fails to respond in a timely fashion due to contention in cAdvisor, leading to a delayed readiness (and test failures). To workaround the issue, we use the `/livez` endpoint in this mode. > >The long-term plan is to switch resource metrics from cAdvisor to the CRI stats API (currently an alpha feature). Once it happens, we can remove this change. > > > >* [ ] I added CHANGELOG entry for this change. >* [X] No user facing changes, so no entry in CHANGELOG was needed. > Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.

openshift-bot commented 5 months ago

[ART PR BUILD NOTIFIER]

This PR has been included in build cluster-monitoring-operator-container-v4.17.0-202405132002.p0.g86b6d4b.assembly.stream.el9 for distgit cluster-monitoring-operator. All builds following this will include this PR.

openshift / cluster-monitoring-operator

OCPBUGS-32510: change metrics-server probes for SNO #2337