Closed rexagod closed 3 months ago
@rexagod: This pull request references MON-3752 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.
[APPROVALNOTIFIER] This PR is APPROVED
This pull-request has been approved by: rexagod
The full list of commands accepted by this bot can be found here.
The pull request process is described here
@rexagod: This pull request references MON-3752 which is a valid jira issue.
Warning: The referenced jira issue has an invalid target version for the target branch this PR targets: expected the task to target the "4.16.0" version, but no target version was set.
KSM Logs
``` I0312 21:31:50.753546 1 wrapper.go:120] "Starting kube-state-metrics" W0312 21:31:50.753856 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754060 1 server.go:192] "Used default resources" I0312 21:31:50.754100 1 types.go:184] "Using all namespaces" I0312 21:31:50.754164 1 server.go:225] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: ^kube_.+_annotations$, ^kube_.+_created$" W0312 21:31:50.754199 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754825 1 utils.go:70] "Tested communication with server" I0312 21:31:50.771593 1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28" gitVersion="v1.28.7+6e2789b" gitTreeState="clean" gitCommit="dfd36a72760e09d4971f8b49ed335f5522dab8af" platform="linux/amd64" I0312 21:31:50.771662 1 utils.go:76] "Communication with server successful" I0312 21:31:50.772149 1 server.go:347] "Started metrics server" metricsServerAddress="127.0.0.1:8081" I0312 21:31:50.772315 1 metrics_handler.go:99] "Autosharding disabled" I0312 21:31:50.774043 1 server.go:336] "Started kube-state-metrics self metrics server" telemetryAddress="127.0.0.1:8082" I0312 21:31:50.776631 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8082 I0312 21:31:50.776699 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8082 I0312 21:31:50.778155 1 builder.go:271] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments" I0312 21:31:50.778438 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8081 I0312 21:31:50.778514 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8081 ```
- [x] I added CHANGELOG entry for this change.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
/jira refresh
@rexagod: This pull request references MON-3752 which is a valid jira issue.
/test e2e-agnostic-operator
The description shows the current behavior, which is to replace the deny-list with the one is specified by the user (CMO configuration deny-list takes precedence over the default one).
However, owing to this, users could very well find themselves (a) denying certain default metrics that various alerts and dashboards depend upon, or (b) enabling certain default metrics that could cause cardinality issues, ^kube_.+_annotations$
, for instance.
Adding high cardinality metrics is a concern, but I'm more concerned about users removing metrics that we rely on in dashboards and alerts. The feature request around this is specifically to add previously denied metrics back. Giving users full reign over the denylist strikes me as quite dangerous, as they now easily render the stack mostly useless.
/cc @Tai-RedHat for PR review and test
@juzhao: GitHub didn't allow me to request PR reviews from the following users: for, PR, review, and, test.
Note that only openshift members and repo collaborators can review this PR, and authors cannot review their own PRs.
/test e2e-aws-ovn-single-node /test versions
Seem there is no erro logs feedback after I modified KSM deploy. before test:
% oc -n openshift-monitoring get deploy kube-state-metrics -ojsonpath='{.spec.template.spec.containers[?(@.name=="kube-state-metrics")].args}' |jq
[
"--host=127.0.0.1",
"--port=8081",
"--telemetry-host=127.0.0.1",
"--telemetry-port=8082",
"--metric-denylist=\n^kube_secret_labels$,\n^kube_.+_annotations$\n",
"--metric-labels-allowlist=pods=[*],nodes=[*],namespaces=[*],persistentvolumes=[*],persistentvolumeclaims=[*],poddisruptionbudgets=[*]",
"--metric-denylist=\n^kube_.+_created$,\n^kube_.+_metadata_resource_version$,\n^kube_replicaset_metadata_generation$,\n^kube_replicaset_status_observed_generation$,\n^kube_pod_restart_policy$,\n^kube_pod_init_container_status_terminated$,\n^kube_pod_init_container_status_running$,\n^kube_pod_container_status_terminated$,\n^kube_pod_container_status_running$,\n^kube_pod_completion_time$,\n^kube_pod_status_scheduled$\n"
]
then I applied cm
% oc apply -f -<<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
kubeStateMetrics:
metricDenylist:
- 123
EOF
Then I check the deploy logs, no new updates.
% oc -n openshift-monitoring logs deploy/kube-state-metrics
I just found that there are error logs in CMO, is it ok to without any feedback to KSM deploy logs?
% oc -n openshift-monitoring logs deploy/cluster-monitoring-operator
...
W0418 09:11:52.319194 1 tasks.go:73] task 5 of 16: UpdatingKubeStateMetrics failed: verifying kube-state-metrics deny-list bounds failed: cannot deny a metric enabled by default: kube_pod_container_info{namespace="openshift-ingress-operator",pod="ingress-operator-8498c75fc5-4lr6s",uid="d640df5f-78b9-44e0-af99-17b0a20123f5",container="ingress-operator",image_spec="registry.build03.ci.openshift.org/ci-ln-rw7l0fb/stable@sha256:7b49677e853709d968b864c880993381d9fb2209d7cfe4205e0abbd540598e8d",image="registry.build03.ci.openshift.org/ci-ln-rw7l0fb/stable@sha256:7b49677e853709d968b864c880993381d9fb2209d7cfe4205e0abbd540598e8d",image_id="registry.build03.ci.openshift.org/ci-ln-rw7l0fb/stable@sha256:7b49677e853709d968b864c880993381d9fb2209d7cfe4205e0abbd540598e8d",container_id="cri-o://71f501e07301ae1b9a1d6a8d67957f31cd31120e4907b8a5b229d6dcdc7efeb7"} 1
@Tai-RedHat PTAL at my comment here: https://github.com/openshift/cluster-monitoring-operator/pull/2283#discussion_r1571542539, which should help provide more context around this patch.
this PR needs to rebase
launch 4.16,openshift/cluster-monitoring-operator#2283 gcp
pull request openshift/cluster-monitoring-operator#2283 needs to be rebased to branch master
PR tested with cluster-bot /label qe-approved
BTW, I found that if I give an error config map like:
% oc apply -f -<<EOF
apiVersion: v1
kind: ConfigMap
metadata:
name: cluster-monitoring-config
namespace: openshift-monitoring
data:
config.yaml: |
kubeStateMetrics:
metricDenylist:
- 123
EOF
the operator will not degrade, is this expected?
% oc get co/monitoring
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
monitoring 4.16.0-0.test-2024-04-23-015105-ci-ln-fdn20ht-latest True False False 45m
I'm asking this because when I give a invalid value such as ^kube_lease_owner$
, the operator will degrated
% oc get co/monitoring
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
monitoring 4.16.0-0.test-2024-04-23-015105-ci-ln-fdn20ht-latest False True True 29s UpdatingPeripherals: verifying kube-state-metrics deny-list bounds failed: cannot deny a metric enabled by default: kube_lease_owner
@rexagod: This pull request references MON-3752 which is a valid jira issue.
KSM Logs
``` I0312 21:31:50.753546 1 wrapper.go:120] "Starting kube-state-metrics" W0312 21:31:50.753856 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754060 1 server.go:192] "Used default resources" I0312 21:31:50.754100 1 types.go:184] "Using all namespaces" I0312 21:31:50.754164 1 server.go:225] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: ^kube_.+_annotations$, ^kube_.+_created$" W0312 21:31:50.754199 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754825 1 utils.go:70] "Tested communication with server" I0312 21:31:50.771593 1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28" gitVersion="v1.28.7+6e2789b" gitTreeState="clean" gitCommit="dfd36a72760e09d4971f8b49ed335f5522dab8af" platform="linux/amd64" I0312 21:31:50.771662 1 utils.go:76] "Communication with server successful" I0312 21:31:50.772149 1 server.go:347] "Started metrics server" metricsServerAddress="127.0.0.1:8081" I0312 21:31:50.772315 1 metrics_handler.go:99] "Autosharding disabled" I0312 21:31:50.774043 1 server.go:336] "Started kube-state-metrics self metrics server" telemetryAddress="127.0.0.1:8082" I0312 21:31:50.776631 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8082 I0312 21:31:50.776699 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8082 I0312 21:31:50.778155 1 builder.go:271] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments" I0312 21:31:50.778438 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8081 I0312 21:31:50.778514 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8081 ```
- [x] I added CHANGELOG entry for this change.
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.
PR tested with cluster-bot /label qe-approved
BTW, I found that if I give an error config map like:
% oc apply -f -<<EOF apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | kubeStateMetrics: metricDenylist: - 123 EOF
does your cluster have 123
metrics which exposed by KSM? if no, it's expected monitoring is not degraded
BTW, I found that if I give an error config map like: % oc apply -f -<<EOF apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | kubeStateMetrics: metricDenylist:
- 123 EOF the operator will not degrade, is this expected?
Yep, it is, since 123 is not within the set of metrics enabled by default, so its okay to disable that.
I'm asking this because when I give a invalid value such as ^kube_lease_owner$, the operator will degrated.
Unlike 123, kube_lease_owner
is a metric that will be enabled by default, so disabling that will cause the operator to become degraded.
/retest-required
Cluster bot is taking longer than expected, so I've pushed the commit here to let the CI save us some time. I'll check locally too once I have one.
Squashed all fixup!
s.
/retest-required
/label docs-approved
/retest
re-test PR with cluster-bot, LGTM
/hold
Until ongoing discussions are done with.
PR needs rebase.
@rexagod: The following tests failed, say /retest
to rerun all failed tests or /retest-required
to rerun all mandatory failed tests:
Test name | Commit | Details | Required | Rerun command |
---|---|---|---|---|
ci/prow/go-fmt | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test go-fmt |
ci/prow/shellcheck | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test shellcheck |
ci/prow/e2e-aws-ovn-techpreview | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test e2e-aws-ovn-techpreview |
ci/prow/unit | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test unit |
ci/prow/verify | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test verify |
ci/prow/rules | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test rules |
ci/prow/versions | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | false | /test versions |
ci/prow/e2e-aws-ovn | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test e2e-aws-ovn |
ci/prow/images | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test images |
ci/prow/jsonnet-fmt | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test jsonnet-fmt |
ci/prow/golangci-lint | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test golangci-lint |
ci/prow/vendor | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test vendor |
ci/prow/e2e-aws-ovn-single-node | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | false | /test e2e-aws-ovn-single-node |
ci/prow/e2e-agnostic-operator | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test e2e-agnostic-operator |
ci/prow/generate | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test generate |
ci/prow/e2e-aws-ovn-upgrade | e3ae58aa2a0ff9cabd697207f90142551629dff0 | link | true | /test e2e-aws-ovn-upgrade |
Full PR test history. Your PR dashboard.
Signed-off-by: Pranshu Srivastava rexagod@gmail.com
CMO Config (with denied metrics)
```yaml apiVersion: v1 kind: ConfigMap metadata: name: cluster-monitoring-config namespace: openshift-monitoring data: config.yaml: | kubeStateMetrics: metricDenylist: - ^kube_.+_created$ - ^kube_.+_annotations$ ```KSM Logs
``` I0312 21:31:50.753546 1 wrapper.go:120] "Starting kube-state-metrics" W0312 21:31:50.753856 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754060 1 server.go:192] "Used default resources" I0312 21:31:50.754100 1 types.go:184] "Using all namespaces" I0312 21:31:50.754164 1 server.go:225] "Metric allow-denylisting" allowDenyStatus="Excluding the following lists that were on denylist: ^kube_.+_annotations$, ^kube_.+_created$" W0312 21:31:50.754199 1 client_config.go:618] Neither --kubeconfig nor --master was specified. Using the inClusterConfig. This might not work. I0312 21:31:50.754825 1 utils.go:70] "Tested communication with server" I0312 21:31:50.771593 1 utils.go:75] "Run with Kubernetes cluster version" major="1" minor="28" gitVersion="v1.28.7+6e2789b" gitTreeState="clean" gitCommit="dfd36a72760e09d4971f8b49ed335f5522dab8af" platform="linux/amd64" I0312 21:31:50.771662 1 utils.go:76] "Communication with server successful" I0312 21:31:50.772149 1 server.go:347] "Started metrics server" metricsServerAddress="127.0.0.1:8081" I0312 21:31:50.772315 1 metrics_handler.go:99] "Autosharding disabled" I0312 21:31:50.774043 1 server.go:336] "Started kube-state-metrics self metrics server" telemetryAddress="127.0.0.1:8082" I0312 21:31:50.776631 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8082 I0312 21:31:50.776699 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8082 I0312 21:31:50.778155 1 builder.go:271] "Active resources" activeStoreNames="certificatesigningrequests,configmaps,cronjobs,daemonsets,deployments,endpoints,horizontalpodautoscalers,ingresses,jobs,leases,limitranges,mutatingwebhookconfigurations,namespaces,networkpolicies,nodes,persistentvolumeclaims,persistentvolumes,poddisruptionbudgets,pods,replicasets,replicationcontrollers,resourcequotas,secrets,services,statefulsets,storageclasses,validatingwebhookconfigurations,volumeattachments" I0312 21:31:50.778438 1 server.go:72] levelinfomsgListening onaddress127.0.0.1:8081 I0312 21:31:50.778514 1 server.go:72] levelinfomsgTLS is disabled.http2falseaddress127.0.0.1:8081 ```