openshift / origin

Conformance test suite for OpenShift
http://www.openshift.org
Apache License 2.0
8.53k stars 4.72k forks source link

NO-JIRA: fix: increase watch count for monitoring operator #29434

Open eggfoobar opened 3 weeks ago

eggfoobar commented 3 weeks ago

Increasing watch count for monitoring operator 110% above p95 for SNO 4.19 runs

(P95 * 1.1)/2 =~ 101

SELECT
  APPROX_QUANTILES(WatchRequestCount, 100)[OFFSET(50)] AS Median,
  APPROX_QUANTILES(WatchRequestCount, 100)[OFFSET(95)] AS Percentile_95,
  APPROX_QUANTILES(WatchRequestCount, 100)[OFFSET(99)] AS Percentile_99
FROM `openshift-ci-data-analysis.ci_data_autodl.operator_watch_requests` AS Watches
  INNER JOIN openshift-gce-devel.ci_analysis_us.jobs AS JobRuns
    ON JobRuns.prowjob_build_id = Watches.JobRunName
  INNER JOIN openshift-ci-data-analysis.ci_data.JobsWithVariants AS Jobs
    ON Jobs.JobName = JobRuns.prowjob_job_name
WHERE
JobRuns.prowjob_job_name LIKE "%4.19%"
AND Watches.Operator = "cluster-monitoring-operator"
AND Watches.ControlPlaneTopology = "SingleReplica"
AND Watches.PlatformType = "AWS"
openshift-ci-robot commented 3 weeks ago

@eggfoobar: This pull request explicitly references no jira issue.

In response to [this](https://github.com/openshift/origin/pull/29434): >Increasing watch count for monitoring operator 110% above p95 for SNO 4.19 runs > >(P95 * 1.1)/2 =~ 101 > >``` >SELECT > APPROX_QUANTILES(WatchRequestCount, 100)[OFFSET(50)] AS Median, > APPROX_QUANTILES(WatchRequestCount, 100)[OFFSET(95)] AS Percentile_95, > APPROX_QUANTILES(WatchRequestCount, 100)[OFFSET(99)] AS Percentile_99 >FROM `openshift-ci-data-analysis.ci_data_autodl.operator_watch_requests` AS Watches > INNER JOIN openshift-gce-devel.ci_analysis_us.jobs AS JobRuns > ON JobRuns.prowjob_build_id = Watches.JobRunName > INNER JOIN openshift-ci-data-analysis.ci_data.JobsWithVariants AS Jobs > ON Jobs.JobName = JobRuns.prowjob_job_name >WHERE >JobRuns.prowjob_job_name LIKE "%4.19%" >AND Watches.Operator = "cluster-monitoring-operator" >AND Watches.ControlPlaneTopology = "SingleReplica" >AND Watches.PlatformType = "AWS" >``` Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Forigin). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 3 weeks ago

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: eggfoobar Once this PR has been reviewed and has the lgtm label, please assign bertinatto for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files: - **[OWNERS](https://github.com/openshift/origin/blob/master/OWNERS)** Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
eggfoobar commented 3 weeks ago

/hold

Hey @sosiouxme, even though the increase is small, it's odd since HA runs seems to have stayed the same. Could not identify any major changes in the monitoring operator but this all started with the new kubelet version. Will quickly look, but feel free to unblock if you're okay with this change.

openshift-ci[bot] commented 3 weeks ago

@eggfoobar: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-openstack-ovn 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-openstack-ovn
ci/prow/e2e-gcp-csi 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-gcp-csi
ci/prow/e2e-gcp-ovn 861098f16b5f317d7dd3d97254969f9d6f82d926 link true /test e2e-gcp-ovn
ci/prow/e2e-aws-ovn-kube-apiserver-rollout 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-aws-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-fips 861098f16b5f317d7dd3d97254969f9d6f82d926 link true /test e2e-aws-ovn-fips
ci/prow/e2e-metal-ipi-ovn-ipv6 861098f16b5f317d7dd3d97254969f9d6f82d926 link true /test e2e-metal-ipi-ovn-ipv6
ci/prow/e2e-aws-ovn-edge-zones 861098f16b5f317d7dd3d97254969f9d6f82d926 link true /test e2e-aws-ovn-edge-zones
ci/prow/e2e-aws-ovn-serial 861098f16b5f317d7dd3d97254969f9d6f82d926 link true /test e2e-aws-ovn-serial
ci/prow/e2e-aws-csi 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-aws-csi
ci/prow/okd-scos-e2e-aws-ovn 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test okd-scos-e2e-aws-ovn
ci/prow/e2e-gcp-ovn-upgrade 861098f16b5f317d7dd3d97254969f9d6f82d926 link true /test e2e-gcp-ovn-upgrade
ci/prow/e2e-metal-ipi-ovn-kube-apiserver-rollout 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-metal-ipi-ovn-kube-apiserver-rollout
ci/prow/e2e-aws-ovn-single-node-serial 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-aws-ovn-single-node-serial
ci/prow/e2e-agnostic-ovn-cmd 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-agnostic-ovn-cmd
ci/prow/e2e-aws-ovn-cgroupsv2 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-aws-ovn-cgroupsv2
ci/prow/e2e-metal-ipi-ovn 861098f16b5f317d7dd3d97254969f9d6f82d926 link false /test e2e-metal-ipi-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).
deads2k commented 3 weeks ago

What monitoring operator code change led to this versus the known etcd quorum stability problems?

eggfoobar commented 3 weeks ago

What monitoring operator code change led to this versus the known etcd quorum stability problems?

At this point, I don't think it has anything to do with the monitoring operator, at least I couldn't identify any change that would cause an added watch count. I wasn't aware of a quorum issue, this PR is just a quick ready to go update since I found it odd that only the monitoring operator was failing here.