kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.2k stars 1.92k forks source link

Duplicate tolerations causing issue with prometheus >= 2.52.0 #2390

Closed rgarcia89 closed 1 month ago

rgarcia89 commented 1 month ago

What happened: Starting with version 2.52.0, Prometheus introduced a mechanism to detect duplicate series during scraping. This can lead to error logs when kube-state-metrics scrapes metrics for deployments, particularly if there are duplicate entries within the toleration array.

prometheus debug logs:

ts=2024-05-13T19:21:09.190Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-state-metrics/0 target=https://10.244.5.6:8443/metrics msg="Duplicate sample for timestamp" series="kube_pod_tolerations{namespace=\"calico-system\",pod=\"calico-kube-controllers-75c647b46c-pg9cr\",uid=\"bf944c52-17bd-438b-bbf1-d97f8671bd6b\",key=\"CriticalAddonsOnly\",operator=\"Exists\"}"
ts=2024-05-13T19:21:09.207Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/monitoring/kube-state-metrics/0 target=https://10.244.5.6:8443/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=1

There might be a need to deduplicate the toleration entries or add an index to entries with existing duplicates.

How to reproduce it (as minimally and precisely as possible):

create the following deployment and look at the metrics produced by kube-state-metrics

apiVersion: apps/v1
kind: Deployment
metadata:
  name: test-deployment
  labels:
    app: something
spec:
  replicas: 1
  selector:
    matchLabels:
      app: test
  template:
    metadata:
      labels:
        app: test
    spec:
      containers:
      - name: test-container
        image: nginx
      tolerations:
       - key: CriticalAddonsOnly
         operator: Exists
       - key: CriticalAddonsOnly
         operator: Exists

Anything else we need to know?: Issue report I opened on the prometheus project https://github.com/prometheus/prometheus/issues/14089

Environment:

dgrisonnet commented 1 month ago

/assign /triage accepted

dgrisonnet commented 1 month ago

Quoting yourself from the issue you opened against Kubernetes:

A validation check within the Kubernetes API server to reject manifests with duplicate tolerations, ensuring adherence to Kubernetes best practices and avoiding potential issues related to duplicate toleration definitions would be great.

This is also what I would expect to be in the kube-apiserver. I don't think we should handle this scenario at kube-state-metrics' level since the object data is erroneous.

I am closing this issue in favor of the Kubernetes one. Feel free to reopen if the Kubernetes maintainers think we should handle this scenario here.