kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.2k stars 1.92k forks source link

Duplicate sample for HPA metrics using multiple external metrics with same metric name #2405

Open tl-eirik-albrigtsen opened 4 weeks ago

tl-eirik-albrigtsen commented 4 weeks ago

What happened: Upgraded to prometheus 2.52 which now is more strict about duplicates: https://github.com/prometheus/prometheus/issues/14089 (similar to https://github.com/kubernetes/kube-state-metrics/issues/2390 )

Have an HPA that looks like this:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: payments-gateway
  namespace: payments
spec:
  maxReplicas: 30
  metrics:
  - external:
      metric:
        name: amqp_messages_unacknowledged
        selector:
          matchLabels:
            queue: queue_one
      target:
        averageValue: "40"
        type: AverageValue
    type: External
  - external:
      metric:
        name: amqp_messages_unacknowledged
        selector:
          matchLabels:
            queue: queue_two
      target:
        averageValue: "40"
        type: AverageValue
    type: External
  - external:
      metric:
        name: amqp_messages_unacknowledged
        selector:
          matchLabels:
            queue: queue_three
      target:
        averageValue: "40"
        type: AverageValue
    type: External
  - external:
      metric:
        name: amqp_messages_unacknowledged
        selector:
          matchLabels:
            queue: queue_four
      target:
        averageValue: "40"
        type: AverageValue
    type: External
  - resource:
      name: cpu
      target:
        averageValue: 300m
        type: AverageValue
    type: Resource
  minReplicas: 3
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: payments-gateway

and this causes KSM to attempt to produce duplicate metrics because it expects the metric name to be unique across target metrics which is not true (the selectors does that).

debug logs from prometheus:

ts=2024-05-30T12:02:24.763Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_spec_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"average\"}"
ts=2024-05-30T12:02:24.763Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_spec_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"average\"}"
ts=2024-05-30T12:02:24.763Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_spec_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"average\"}"
ts=2024-05-30T12:02:24.764Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_status_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"value\"}"
ts=2024-05-30T12:02:24.764Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_status_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"average\"}"
ts=2024-05-30T12:02:24.764Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_status_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"value\"}"
ts=2024-05-30T12:02:24.764Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_status_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"average\"}"
ts=2024-05-30T12:02:24.764Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_status_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"value\"}"
ts=2024-05-30T12:02:24.764Z caller=scrape.go:1777 level=debug component="scrape manager" scrape_pool=serviceMonitor/monitoring/prometheus-stack-kube-state-metrics/0 target=http://10.42.123.239:8080/metrics msg="Duplicate sample for timestamp" series="kube_horizontalpodautoscaler_status_target_metric{namespace=\"payments\",horizontalpodautoscaler=\"payments-gateway\",metric_name=\"amqp_messages_unacknowledged\",metric_target_type=\"average\"}"

and this causes the standard mixin alert PrometheusDuplicateTimestamps to continuously trigger.

What you expected to happen:

No duplicate metrics. I'm guessing the temporary solution is to drop action on kube_horizontalpodautoscaler_status_target_metric|kube_horizontalpodautoscaler_spec_target_metric, but figured it might be worth raising an issue here for others.

How to reproduce it (as minimally and precisely as possible):

An HPA like above, and some way to use external metrics in HPAs (prometheus-adapter or keda I guess) and I expectdefault kube-state-metrics scraping of hpa metrics.

Anything else we need to know?:

Environment:

tl-eirik-albrigtsen commented 4 weeks ago

Working workaround with metricsRelabelings (here using kube-prometheus-stack):


  kube-state-metrics:
    prometheus:
      monitor:
        enabled: true
        metricRelabelings:
        - action: drop
          sourceLabels: [__name__]
          # these metrics generates duplicates
          # https://github.com/kubernetes/kube-state-metrics/issues/2405
          regex: kube_horizontalpodautoscaler_status_target_metric|kube_horizontalpodautoscaler_spec_target_metric
dgrisonnet commented 2 weeks ago

Could be related to https://github.com/kubernetes/kube-state-metrics/issues/2408

/assign /triage accepted