kubernetes / kube-state-metrics

Add-on agent to generate and expose cluster-level metrics.
https://kubernetes.io/docs/concepts/cluster-administration/kube-state-metrics/
Apache License 2.0
5.2k stars 1.92k forks source link

Duplicate kube_horizontalpodautoscaler_spec_target_metric #2403

Open dmitriishaburov opened 1 month ago

dmitriishaburov commented 1 month ago

What happened:

Duplicate kube_horizontalpodautoscaler_spec_target_metric causing issues with Prometheus 2.52.0+ Due to new duplicates detection mechanism, I'm seeing following errors in Prometheus:

ts=2024-05-28T07:15:26.730Z caller=scrape.go:1738 level=warn component="scrape manager" scrape_pool=serviceMonitor/prometheus/kube-prometheus-stack-kube-state-metrics/0 target=http://172.17.60.213:8080/metrics msg="Error on ingesting samples with different value but same timestamp" num_dropped=12

After checking kube-state metrics I've found that metrics for HPA are duplicated:

cat metrics | grep 'kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"}'
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100
kube_horizontalpodautoscaler_spec_target_metric{namespace="dummy",horizontalpodautoscaler="dummy",metric_name="cpu",metric_target_type="utilization"} 100

HPA itself have 3 separate entries for CPU, but there's different on container:

k get hpa dummy -oyaml
  metrics:
  - containerResource:
      container: dummy-0
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource
  - containerResource:
      container: dummy-1
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource
  - containerResource:
      container: dummy-2
      name: cpu
      target:
        averageUtilization: 100
        type: Utilization
    type: ContainerResource

What you expected to happen: kube-state-metrics not producing duplicate metrics, probably by adding container label (?)

How to reproduce it (as minimally and precisely as possible):

Environment:

k8s-ci-robot commented 1 month ago

This issue is currently awaiting triage.

If kube-state-metrics contributors determine this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Instructions for interacting with me using PR comments are available [here](https://git.k8s.io/community/contributors/guide/pull-requests.md). If you have questions or suggestions related to my behavior, please file an issue against the [kubernetes-sigs/prow](https://github.com/kubernetes-sigs/prow/issues/new?title=Prow%20issue:) repository.
sansalva commented 3 weeks ago

The main issue is that you don't get a specific metric for each ContainerResources trigger when using multiple (assuming the same resource name and target type). It generates the same metric with the same label values. e.g:

metrics:
    - type: ContainerResource
      containerResource:
        name: cpu
        container: main
        target:
          type: Utilization
          averageUtilization: 40
    - type: ContainerResource
      containerResource:
        name: cpu
        container: istio-proxy
        target:
          type: Utilization
          averageUtilization: 90

you get all target_metric specific metrics duplicated, making it impossible to identify which one applies to each. Also when Prometheus scrapes these, only take one of each, losing information.

kube_horizontalpodautoscaler_spec_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 40
kube_horizontalpodautoscaler_spec_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 90
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="average"} 1.442
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 24
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="average"} 0.836
kube_horizontalpodautoscaler_status_target_metric{namespace="prod",horizontalpodautoscaler="sample",metric_name="cpu",metric_target_type="utilization"} 69

As @dmitriishaburov suggested, one option would be to have a container label. However, we need to consider what value it would have when using type=Resource. Additionally, with this option, there is a chance that a container name could collide with that specific value. We could also add a type label to solve this, with values resource and container_resource.

Another option would be to have separate metrics for type=Resources and type=ContainerResources. The problem I see with this approach is that it splits the information about the same concept into 2 different metrics, also making building charts more complicated because you will have to duplicate the queries to consider both types in every scenario