suxess-it / kubriX

https://kubrix.io
17 stars 2 forks source link

[monitoring] high prometheus metrics cardinality #380

Closed jkleinlercher closed 3 months ago

jkleinlercher commented 3 months ago

some prometheus gets an error with remoteWrite to mimir:

ts=2024-08-01T09:51:39.852Z caller=dedupe.go:112 component=remote level=error remote_name=a65cf4 url=https://metrics-monitoring.lab.suxessit.k8s.cloud.uibk.ac.at/api/v1/push msg="non-recoverable error" count=447 exemplarCount=0 err="server returned HTTP status 400 Bad Request: send data to ingesters: failed pushing to ingester sx-mimir-ingester-zone-a-0: user=anonymous: per-user series limit of 150000 exceeded (err-mimir-max-series-per-user). To adjust the related per-tenant limit, configure -ingester.max-global-series-per-user, or contact your service administrator. (sampled 1/10)"
jkleinlercher commented 3 months ago

We definitely have to many metrics series on our prometheus instance on our sx-cnp-oss cluster.

image

see http://localhost:9090/api/v1/status/tsdb series-count-by-metrics

    "seriesCountByMetricName": [
      {
        "name": "apiserver_request_duration_seconds_bucket",
        "value": 14076
      },
      {
        "name": "etcd_request_duration_seconds_bucket",
        "value": 14064
      },
      {
        "name": "apiserver_request_sli_duration_seconds_bucket",
        "value": 12276
      },
      {
        "name": "apiserver_request_slo_duration_seconds_bucket",
        "value": 12276
      },
      {
        "name": "apiserver_response_sizes_bucket",
        "value": 3120
      },
      {
        "name": "thanos_objstore_bucket_operation_duration_seconds_bucket",
        "value": 2205
      },
      {
        "name": "workqueue_work_duration_seconds_bucket",
        "value": 2002
      },
      {
        "name": "workqueue_queue_duration_seconds_bucket",
        "value": 2002
      },
      {
        "name": "scheduler_plugin_execution_duration_seconds_bucket",
        "value": 1806
      },
      {
        "name": "grpc_server_handled_total",
        "value": 1581
      }
    ],
    "seriesCountByLabelValuePair": [
      {
        "name": "job=kubelet",
        "value": 57365
      },
      {
        "name": "endpoint=https-metrics",
        "value": 57364
      },
      {
        "name": "service=sx-kube-prometheus-stack-kubelet",
        "value": 54620
      },
      {
        "name": "namespace=kube-system",
        "value": 50614
      },
      {
        "name": "metrics_path=/metrics",
        "value": 49018
      },
      {
        "name": "node=k3d-cnp-local-demo-server-0",
        "value": 47514
      },
      {
        "name": "component=apiserver",
        "value": 46640
      },
      {
        "name": "instance=172.25.0.3:10250",
        "value": 45114
      },
      {
        "name": "namespace=default",
        "value": 40495
      },
      {
        "name": "endpoint=https",
        "value": 40373
      }
    ]
jkleinlercher commented 3 months ago

interesting guides:

https://last9.io/blog/how-to-manage-high-cardinality-metrics-in-prometheus/ https://grafana.com/blog/2022/10/20/how-to-manage-high-cardinality-metrics-in-prometheus-and-kubernetes/

kubectl port-forward svc/sx-kube-prometheus-stack-prometheus -n monitoring 9090:9090

then

check the status pages of prometheus: http://localhost:9090/tsdb-status

Queries:

topk(100, count by (__name__, job)({__name__=~".+"}))
topk(100, count by (__name__, instance)({__name__=~".+"}))
jkleinlercher commented 3 months ago

in the querier deployment I set - '-querier.cardinality-analysis-enabled=true' args.

kubectl edit deployment sx-mimir-querier -n mimir

[...]
    spec:
      containers:
      - args:
        - -target=querier
        - -config.expand-env=true
        - -config.file=/etc/mimir/mimir.yaml
        - -querier.cardinality-analysis-enabled=true

according to https://grafana.com/blog/2022/10/20/how-to-manage-high-cardinality-metrics-in-prometheus-and-kubernetes/ but nothing interesting here

jkleinlercher commented 3 months ago

maybe these dashboards help: https://github.com/cerndb/grafana-mimir-cardinality-dashboards/tree/main

jkleinlercher commented 3 months ago

next steps:

jkleinlercher commented 3 months ago

local environment KIND_OBSERVABILITY set up, ext step see https://github.com/suxess-it/sx-cnp-oss/issues/380#issuecomment-2264659275

jkleinlercher commented 3 months ago

local installation of our observability stack has same high amount of series counts

curl http://localhost:9090/api/v1/status/tsdb |jq

{
  "status": "success",
  "data": {
    "headStats": {
      "numSeries": 133146,
      "numLabelPairs": 8184,
      "chunkCount": 266703,
      "minTime": 1722580538141,
      "maxTime": 1722586972676
    },
    "seriesCountByMetricName": [
      {
        "name": "etcd_request_duration_seconds_bucket",
        "value": 14064
      },
      {
        "name": "apiserver_request_duration_seconds_bucket",
        "value": 13968
      },
      {
        "name": "apiserver_request_sli_duration_seconds_bucket",
        "value": 12144
      },
      {
        "name": "apiserver_request_slo_duration_seconds_bucket",
        "value": 12144
      },
      {
        "name": "apiserver_response_sizes_bucket",
        "value": 3104
      },

Also Grafana Mimir-Dashboards show same high counts:

image

Next Step: check if removing some scraping-config according to https://github.com/suxess-it/sx-cnp-oss/issues/353#issuecomment-2263946662 helps

jkleinlercher commented 3 months ago

also good document: https://medium.com/@dotdc/prometheus-performance-and-cardinality-in-practice-74d5d9cd6230 and https://medium.com/@dotdc/how-to-find-unused-prometheus-metrics-using-mimirtool-a44560173543

jkleinlercher commented 3 months ago

So our next steps will be:

Maybe https://grafana.com/docs/grafana-cloud/monitor-infrastructure/kubernetes-monitoring/configuration/config-other-methods/helm-operator-migration/reduce_usage/ also helps

the picture in https://victoriametrics.com/blog/cardinality-explorer/ is quiet nice

jkleinlercher commented 3 months ago

Unused metrics with high series count (> 1000 series count):

apiserver_request_duration_seconds_bucket etcd_request_duration_seconds_bucket apiserver_request_slo_duration_seconds_bucket apiserver_response_sizes_bucket workqueue_work_duration_seconds_bucket scheduler_plugin_execution_duration_seconds_bucket apiserver_watch_events_sizes_bucket

jkleinlercher commented 3 months ago

one important thing I learned in my local k3d environment: you need to find out which prometheus job scrapes this metrics, then you know in which part of the kube-prometheus-stack values you need to set the metricRelabeling:

image

I guessed (and maybe that is also true for non-k3d-clusters) this metric 'etcd_request_duration_seconds_bucket' should be dropped in kubeEtcd section. However, for whatever reason the metrics get scraped by apiserver and kubelet. So I needed to drop them in the sections "kubelet" and "kubeApiServer"

jkleinlercher commented 3 months ago

while working on this I realized it takes a lot of time to drop metrics and on uibklab some of the metrics which mimirtool reports as unused are now used. Also, some of the metrics are not just from kube-prometheus-stack but from other applications like kyverno or argocd, ... So now I think if the approach in https://grafana.com/docs/grafana-cloud/monitor-infrastructure/kubernetes-monitoring/configuration/config-other-methods/helm-operator-migration/reduce_usage/ could be better for us. It defines an allow_list before writing it to mimir.

However, for now I think the easiest and fastest solution would be to increase the max_global_series_per_user like others did in e.g. https://github.com/grafana/helm-charts/issues/1320 for now and improve on the metrics and capacity planning for this afterwards.

jkleinlercher commented 3 months ago

with https://github.com/suxess-it/sx-cnp-oss/issues/390 we changed from kube-prometheus-stack to k8s-monitoring helm chart and now we are at about 60k metrics series. So this issue is solved for now