kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.44k stars 2.02k forks source link

Duplicate tolerations causing issue with prometheus >= 2.52.0 #2390

Open rgarcia89 opened 6 months ago

rgarcia89 commented 6 months ago

What happened: Starting with version 2.52.0, Prometheus introduced a mechanism to detect duplicate series during scraping. This can lead to error logs when kube-state-metrics scrapes metrics for deployments, particularly if there are duplicate entries within the toleration array.

prometheus debug logs:

ts=2024-05-13T19:21:09.190Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-state-metrics/0 target=https://10.244.5.6:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"calico-system\",pod=\"calico-kube-controllers-75c647b46c-pg9cr\",uid=\"bf944c52-17bd-438b-bbf1-d97f8671bd6b\",key=\"CriticalAddonsOnly\",operator=\"Exists\"}"
ts=2024-05-13T19:21:09.207Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-state-metrics/0 target=https://10.244.5.6:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1

There might be a need to deduplicate the toleration entries or add an index to entries with existing duplicates.

How to reproduce it (as minimally and precisely as possible):

create the following deployment and look at the metrics produced by kube-state-metrics

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  labels:
    app: something
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
      - name: test-container
        image: nginx
      tolerations:
       - key: CriticalAddonsOnly
         operator: Exists
       - key: CriticalAddonsOnly
         operator: Exists

Anything else we need to know?: Issue report I opened on the prometheus project https://github.com/prometheus/prometheus/issues/14089

Environment:

dgrisonnet commented 6 months ago

/assign /triage accepted

dgrisonnet commented 6 months ago

Quoting yourself from the issue you opened against Kubernetes:

A validation check within the Kubernetes API server to reject manifests with duplicate tolerations, ensuring adherence to Kubernetes best practices and avoiding potential issues related to duplicate toleration definitions would be great.

This is also what I would expect to be in the kube-apiserver. I don't think we should handle this scenario at kube-state-metrics' level since the object data is erroneous.

I am closing this issue in favor of the Kubernetes one. Feel free to reopen if the Kubernetes maintainers think we should handle this scenario here.

RiRa12621 commented 2 days ago

https://github.com/kubernetes/kubernetes/issues/124881#issuecomment-2491489140

seems this got bounced back here @dgrisonnet

k8s-ci-robot commented 2 days ago

@RiRa12621: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to [this](https://github.com/kubernetes/kube-state-metrics/issues/2390#issuecomment-2491505301): >/reopen >https://github.com/kubernetes/kubernetes/issues/124881#issuecomment-2491489140 Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
dgrisonnet commented 2 days ago

Thanks for the heads up @RiRa12621 :)

/reopen

/unassign /help

If anyone is interested in contributing the logic to make sure that there are only unique tolerations, feel free to self-assign the issue and draft a PR.

k8s-ci-robot commented 2 days ago

@dgrisonnet: Reopened this issue.

In response to [this](https://github.com/kubernetes/kube-state-metrics/issues/2390#issuecomment-2491545073): >Thanks for the heads up @RiRa12621 :) > >/reopen > >/unassign >/help > >If anyone is interested in contributing the logic to make sure that there are only unique tolerations, feel free to self-assign the issue and draft a PR. Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
RiRa12621 commented 2 days ago

/assign @RiRa12621 Not sure if this is the most elegant way, but should do the job: https://github.com/kubernetes/kube-state-metrics/pull/2559

This takes all tolerations, only gets the unique ones and then the regular logic is applied to those.