kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.38k stars 2.01k forks source link

Duplicate samples for customResourceState metrics #2446

Closed speer closed 4 days ago

speer commented 3 months ago

What happened:

We upgraded to Prometheus 2.52 and started receiving the following warnings:

ts=2024-07-11T06:43:56.289Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/k8s-monitoring/kube-prometheus-stack-kube-state-metrics/0 target=http://x.x.x.x:8080/metrics msg="Error on inge
sting samples with different value but same timestamp" num_dropped=23

We found similar open issues about duplicates, but this one is about all the metrics configured via customResourceState config.

After a fresh restart of the kube-state-metrics pod, the metrics are not duplicated. However after a while, each of the metrics, configured via customResourceState is suddenly present twice or even multiple times:

# There is exactly 1 kind: HelmRepository
$ kubectl get helmrepositories.source.toolkit.fluxcd.io -n flux-system
NAME     URL                     AGE
acraks   oci://xxxx.azurecr.io   8d

# After kube-state-metrics runs a while, it returns 3 exact same metrics
$ curl http://kube-prometheus-stack-kube-state-metrics.k8s-monitoring:8080/metrics | grep HelmRepository | grep flux-system
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1

# After a restart of kube-state-metrics, there are no duplications for a while
$ kubectl delete pod kube-prometheus-stack-kube-state-metrics-76968f786b-z7m8t
$ curl http://kube-prometheus-stack-kube-state-metrics.k8s-monitoring:8080/metrics | grep HelmRepository | grep flux-system
gotk_resource_info{customresource_group="source.toolkit.fluxcd.io",customresource_kind="HelmRepository",customresource_version="v1",exported_namespace="flux-system",name="acraks",url="oci://xxxx.azurecr.io"} 1

What you expected to happen:

No duplicates, as the resource exists just once and all labels are the same.

How to reproduce it (as minimally and precisely as possible):

Use the configuration provided here: https://fluxcd.io/flux/monitoring/custom-metrics or the customResourceState config below:

apiVersion: v1
data:
  config.yaml: |
    spec:
      resources:
      - groupVersionKind:
          group: source.toolkit.fluxcd.io
          kind: HelmRepository
          version: v1
        metricNamePrefix: gotk
        metrics:
        - each:
            info:
              labelsFromPath:
                name:
                - metadata
                - name
            type: Info
          help: The current state of a Flux HelmRepository resource.
          labelsFromPath:
            exported_namespace:
            - metadata
            - namespace
            ready:
            - status
            - conditions
            - '[type=Ready]'
            - status
            revision:
            - status
            - artifact
            - revision
            suspended:
            - spec
            - suspend
            url:
            - spec
            - url
          name: resource_info
kind: ConfigMap
metadata:
  labels:
    app.kubernetes.io/component: metrics
    app.kubernetes.io/instance: kube-prometheus-stack
    app.kubernetes.io/managed-by: Helm
    app.kubernetes.io/name: kube-state-metrics
    app.kubernetes.io/part-of: kube-state-metrics
    app.kubernetes.io/version: 2.12.0
    helm.sh/chart: kube-state-metrics-5.20.0
    helm.toolkit.fluxcd.io/name: kube-prometheus-stack
    helm.toolkit.fluxcd.io/namespace: flux-system
    release: kube-prometheus-stack
  name: kube-prometheus-stack-kube-state-metrics-customresourcestate-config
  namespace: k8s-monitoring

Anything else we need to know?:

Environment:

fischerman commented 3 months ago

I can confirm this. The problem only occurs after some time.

Toasterson commented 2 months ago

Can confirm this bug to still be present. Since this Application is bundled with kube-prometheus-stack it would be nice to get an update. There is even a PR that was closed by the bot rather than merged.

dgrisonnet commented 2 months ago

/assign @rexagod /triage accepted

m3co-code commented 3 weeks ago

Just confirming that this is an issue and the PR looks like a promising and dire needed fix. KSM metrics output is invalid after CR updates which is quite severe for us.

Thanks for already bringing up a PR \o/