openshift / cluster-monitoring-operator

Manage the OpenShift monitoring stack
Apache License 2.0
247 stars 363 forks source link

MON-3752: expose metric-denylist for KSM #2283

Closed rexagod closed 3 months ago

rexagod commented 7 months ago

Signed-off-by: Pranshu Srivastava rexagod@gmail.com

CMO Config (with denied metrics) ```yaml apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | kubeStateMetrics: metricDenylist: - ^kube_.+_created$ - ^kube_.+_annotations$ ```
KSM Logs ``` I0312 21:31:50.753546 1 wrapper.go:120] "Starting kube-state-metrics" W0312 21:31:50.753856 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754060 1 server.go:192] "Used default resources" I0312 21:31:50.754100 1 types.go:184] "Using all namespaces" I0312 21:31:50.754164 1 server.go:225] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: ^kube_.+_annotations$, ^kube_.+_created$" W0312 21:31:50.754199 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754825 1 utils.go:70] "Tested communication with server" I0312 21:31:50.771593 1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28" gitVersion="v1.28.7+6e2789b" gitTreeState="clean" gitCommit="dfd36a72760e09d4971f8b49ed335f5522dab8af" platform="linux/amd64" I0312 21:31:50.771662 1 utils.go:76] "Communication with server successful" I0312 21:31:50.772149 1 server.go:347] "Started metrics server" metricsServerAddress="127.0.0.1:8081" I0312 21:31:50.772315 1 metrics_handler.go:99] "Autosharding disabled" I0312 21:31:50.774043 1 server.go:336] "Started kube-state-metrics self metrics server" telemetryAddress="127.0.0.1:8082" I0312 21:31:50.776631 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8082 I0312 21:31:50.776699 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8082 I0312 21:31:50.778155 1 builder.go:271] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments" I0312 21:31:50.778438 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8081 I0312 21:31:50.778514 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8081 ```

openshift-ci-robot commented 7 months ago

@rexagod: This pull request references MON-3752 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2283): >_Signed-off-by: Pranshu Srivastava _ > >*** > >* [x] I added CHANGELOG entry for this change. Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
openshift-ci[bot] commented 7 months ago

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rexagod

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files: - ~~[OWNERS](https://github.com/openshift/cluster-monitoring-operator/blob/master/OWNERS)~~ [rexagod] Approvers can indicate their approval by writing `/approve` in a comment Approvers can cancel approval by writing `/approve cancel` in a comment
openshift-ci-robot commented 7 months ago

@rexagod: This pull request references MON-3752 which is a valid jira issue.

Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2283): >_Signed-off-by: Pranshu Srivastava _ > >
>CMO Config (with denied metrics) > >```yaml >apiVersion: v1 >kind: ConfigMap >metadata: > name: cluster-monitoring-config > namespace: openshift-monitoring >data: > config.yaml: | > kubeStateMetrics: > metricDenylist: > - ^kube_.+_created$ > - ^kube_.+_annotations$ >``` > >
KSM Logs ``` I0312 21:31:50.753546 1 wrapper.go:120] "Starting kube-state-metrics" W0312 21:31:50.753856 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754060 1 server.go:192] "Used default resources" I0312 21:31:50.754100 1 types.go:184] "Using all namespaces" I0312 21:31:50.754164 1 server.go:225] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: ^kube_.+_annotations$, ^kube_.+_created$" W0312 21:31:50.754199 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754825 1 utils.go:70] "Tested communication with server" I0312 21:31:50.771593 1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28" gitVersion="v1.28.7+6e2789b" gitTreeState="clean" gitCommit="dfd36a72760e09d4971f8b49ed335f5522dab8af" platform="linux/amd64" I0312 21:31:50.771662 1 utils.go:76] "Communication with server successful" I0312 21:31:50.772149 1 server.go:347] "Started metrics server" metricsServerAddress="127.0.0.1:8081" I0312 21:31:50.772315 1 metrics_handler.go:99] "Autosharding disabled" I0312 21:31:50.774043 1 server.go:336] "Started kube-state-metrics self metrics server" telemetryAddress="127.0.0.1:8082" I0312 21:31:50.776631 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8082 I0312 21:31:50.776699 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8082 I0312 21:31:50.778155 1 builder.go:271] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments" I0312 21:31:50.778438 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8081 I0312 21:31:50.778514 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8081 ```

  • [x] I added CHANGELOG entry for this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

rexagod commented 7 months ago

/jira refresh

openshift-ci-robot commented 7 months ago

@rexagod: This pull request references MON-3752 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2283#issuecomment-1992637352): >/jira refresh Instructions for interacting with me using PR comments are available [here](https://prow.ci.openshift.org/command-help?repo=openshift%2Fcluster-monitoring-operator). If you have questions or suggestions related to my behavior, please file an issue against the [openshift-eng/jira-lifecycle-plugin](https://github.com/openshift-eng/jira-lifecycle-plugin/issues/new) repository.
rexagod commented 7 months ago

/test e2e-agnostic-operator

rexagod commented 7 months ago

The description shows the current behavior, which is to replace the deny-list with the one is specified by the user (CMO configuration deny-list takes precedence over the default one).

However, owing to this, users could very well find themselves (a) denying certain default metrics that various alerts and dashboards depend upon, or (b) enabling certain default metrics that could cause cardinality issues, ^kube_.+_annotations$, for instance.

jan--f commented 7 months ago

Adding high cardinality metrics is a concern, but I'm more concerned about users removing metrics that we rely on in dashboards and alerts. The feature request around this is specifically to add previously denied metrics back. Giving users full reign over the denylist strikes me as quite dangerous, as they now easily render the stack mostly useless.

juzhao commented 6 months ago

/cc @Tai-RedHat for PR review and test

openshift-ci[bot] commented 6 months ago

@juzhao: GitHub didn't allow me to request PR reviews from the following users: for, PR, review, and, test.

Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2283#issuecomment-2062911157): >/cc @Tai-RedHat for PR review and test Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
juzhao commented 6 months ago

/test e2e-aws-ovn-single-node /test versions

Tai-RedHat commented 6 months ago

Seem there is no erro logs feedback after I modified KSM deploy. before test:

% oc -n openshift-monitoring get deploy kube-state-metrics -ojsonpath='{.spec.template.spec.containers[?(@.name=="kube-state-metrics")].args}' |jq
[
  "--host=127.0.0.1",
  "--port=8081",
  "--telemetry-host=127.0.0.1",
  "--telemetry-port=8082",
  "--metric-denylist=\n^kube_secret_labels$,\n^kube_.+_annotations$\n",
  "--metric-labels-allowlist=pods=[*],nodes=[*],namespaces=[*],persistentvolumes=[*],persistentvolumeclaims=[*],poddisruptionbudgets=[*]",
  "--metric-denylist=\n^kube_.+_created$,\n^kube_.+_metadata_resource_version$,\n^kube_replicaset_metadata_generation$,\n^kube_replicaset_status_observed_generation$,\n^kube_pod_restart_policy$,\n^kube_pod_init_container_status_terminated$,\n^kube_pod_init_container_status_running$,\n^kube_pod_container_status_terminated$,\n^kube_pod_container_status_running$,\n^kube_pod_completion_time$,\n^kube_pod_status_scheduled$\n"
]

then I applied cm

% oc apply -f -<<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    kubeStateMetrics:
      metricDenylist:
        - 123
EOF

Then I check the deploy logs, no new updates. % oc -n openshift-monitoring logs deploy/kube-state-metrics

Tai-RedHat commented 6 months ago

I just found that there are error logs in CMO, is it ok to without any feedback to KSM deploy logs?

% oc -n openshift-monitoring logs deploy/cluster-monitoring-operator
...
W0418 09:11:52.319194       1 tasks.go:73] task 5 of 16: UpdatingKubeStateMetrics failed: verifying kube-state-metrics deny-list bounds failed: cannot deny a metric enabled by default: kube_pod_container_info{namespace="openshift-ingress-operator",pod="ingress-operator-8498c75fc5-4lr6s",uid="d640df5f-78b9-44e0-af99-17b0a20123f5",container="ingress-operator",image_spec="registry.build03.ci.openshift.org/ci-ln-rw7l0fb/stable@sha256:7b49677e853709d968b864c880993381d9fb2209d7cfe4205e0abbd540598e8d",image="registry.build03.ci.openshift.org/ci-ln-rw7l0fb/stable@sha256:7b49677e853709d968b864c880993381d9fb2209d7cfe4205e0abbd540598e8d",image_id="registry.build03.ci.openshift.org/ci-ln-rw7l0fb/stable@sha256:7b49677e853709d968b864c880993381d9fb2209d7cfe4205e0abbd540598e8d",container_id="cri-o://71f501e07301ae1b9a1d6a8d67957f31cd31120e4907b8a5b229d6dcdc7efeb7"} 1
rexagod commented 6 months ago

@Tai-RedHat PTAL at my comment here: https://github.com/openshift/cluster-monitoring-operator/pull/2283#discussion_r1571542539, which should help provide more context around this patch.

Tai-RedHat commented 6 months ago

this PR needs to rebase

launch 4.16,openshift/cluster-monitoring-operator#2283 gcp
pull request openshift/cluster-monitoring-operator#2283 needs to be rebased to branch master
Tai-RedHat commented 6 months ago

PR tested with cluster-bot /label qe-approved

BTW, I found that if I give an error config map like:

% oc apply -f -<<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    kubeStateMetrics:
      metricDenylist:
        - 123
EOF

the operator will not degrade, is this expected?

% oc get co/monitoring
NAME         VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.16.0-0.test-2024-04-23-015105-ci-ln-fdn20ht-latest   True        False         False      45m

I'm asking this because when I give a invalid value such as ^kube_lease_owner$, the operator will degrated

% oc get co/monitoring
NAME         VERSION                                                AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.16.0-0.test-2024-04-23-015105-ci-ln-fdn20ht-latest   False       True          True       29s     UpdatingPeripherals: verifying kube-state-metrics deny-list bounds failed: cannot deny a metric enabled by default: kube_lease_owner
openshift-ci-robot commented 6 months ago

@rexagod: This pull request references MON-3752 which is a valid jira issue.

In response to [this](https://github.com/openshift/cluster-monitoring-operator/pull/2283): >_Signed-off-by: Pranshu Srivastava _ > >
>CMO Config (with denied metrics) > >```yaml >apiVersion: v1 >kind: ConfigMap >metadata: > name: cluster-monitoring-config > namespace: openshift-monitoring >data: > config.yaml: | > kubeStateMetrics: > metricDenylist: > - ^kube_.+_created$ > - ^kube_.+_annotations$ >``` > >
KSM Logs ``` I0312 21:31:50.753546 1 wrapper.go:120] "Starting kube-state-metrics" W0312 21:31:50.753856 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754060 1 server.go:192] "Used default resources" I0312 21:31:50.754100 1 types.go:184] "Using all namespaces" I0312 21:31:50.754164 1 server.go:225] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: ^kube_.+_annotations$, ^kube_.+_created$" W0312 21:31:50.754199 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754825 1 utils.go:70] "Tested communication with server" I0312 21:31:50.771593 1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28" gitVersion="v1.28.7+6e2789b" gitTreeState="clean" gitCommit="dfd36a72760e09d4971f8b49ed335f5522dab8af" platform="linux/amd64" I0312 21:31:50.771662 1 utils.go:76] "Communication with server successful" I0312 21:31:50.772149 1 server.go:347] "Started metrics server" metricsServerAddress="127.0.0.1:8081" I0312 21:31:50.772315 1 metrics_handler.go:99] "Autosharding disabled" I0312 21:31:50.774043 1 server.go:336] "Started kube-state-metrics self metrics server" telemetryAddress="127.0.0.1:8082" I0312 21:31:50.776631 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8082 I0312 21:31:50.776699 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8082 I0312 21:31:50.778155 1 builder.go:271] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments" I0312 21:31:50.778438 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8081 I0312 21:31:50.778514 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8081 ```

  • [x] I added CHANGELOG entry for this change.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

juzhao commented 6 months ago

PR tested with cluster-bot /label qe-approved

BTW, I found that if I give an error config map like:


% oc apply -f -<<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-monitoring-config
  namespace: openshift-monitoring
data:
  config.yaml: |
    kubeStateMetrics:
      metricDenylist:
        - 123
EOF

does your cluster have 123 metrics which exposed by KSM? if no, it's expected monitoring is not degraded

rexagod commented 6 months ago

BTW, I found that if I give an error config map like: % oc apply -f -<<EOF apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | kubeStateMetrics: metricDenylist:

  • 123 EOF the operator will not degrade, is this expected?

Yep, it is, since 123 is not within the set of metrics enabled by default, so its okay to disable that.

I'm asking this because when I give a invalid value such as ^kube_lease_owner$, the operator will degrated.

Unlike 123, kube_lease_owner is a metric that will be enabled by default, so disabling that will cause the operator to become degraded.

rexagod commented 6 months ago

/retest-required

rexagod commented 6 months ago

Cluster bot is taking longer than expected, so I've pushed the commit here to let the CI save us some time. I'll check locally too once I have one.

rexagod commented 6 months ago

Squashed all fixup!s.

rexagod commented 5 months ago

/retest-required

eromanova97 commented 5 months ago

/label docs-approved

jan--f commented 5 months ago

/retest

Tai-RedHat commented 5 months ago

re-test PR with cluster-bot, LGTM

rexagod commented 5 months ago

/hold

Until ongoing discussions are done with.

openshift-merge-robot commented 5 months ago

PR needs rebase.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes/test-infra](https://github.com/kubernetes/test-infra/issues/new?title=Prow%20issue:) repository.
openshift-ci[bot] commented 5 months ago

@rexagod: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/go-fmt e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test go-fmt
ci/prow/shellcheck e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test shellcheck
ci/prow/e2e-aws-ovn-techpreview e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test e2e-aws-ovn-techpreview
ci/prow/unit e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test unit
ci/prow/verify e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test verify
ci/prow/rules e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test rules
ci/prow/versions e3ae58aa2a0ff9cabd697207f90142551629dff0 link false /test versions
ci/prow/e2e-aws-ovn e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test e2e-aws-ovn
ci/prow/images e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test images
ci/prow/jsonnet-fmt e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test jsonnet-fmt
ci/prow/golangci-lint e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test golangci-lint
ci/prow/vendor e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test vendor
ci/prow/e2e-aws-ovn-single-node e3ae58aa2a0ff9cabd697207f90142551629dff0 link false /test e2e-aws-ovn-single-node
ci/prow/e2e-agnostic-operator e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test e2e-agnostic-operator
ci/prow/generate e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test generate
ci/prow/e2e-aws-ovn-upgrade e3ae58aa2a0ff9cabd697207f90142551629dff0 link true /test e2e-aws-ovn-upgrade

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository. I understand the commands that are listed [here](https://go.k8s.io/bot-commands).