grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.13k stars 529 forks source link

Mimir dashboards issue after upgrade from 2.8 to 2.9 #5583

Open mkovsher opened 1 year ago

mkovsher commented 1 year ago

After upgrading from version 2.8 (helm 4.4.1) to 2.9. (helm 5.0.0) some dashboards don't work.

Prometheus converts the "cortex_request_duration_seconds_count metric" to "cortex_request_duration_seconds{count:n sum:...)

To Reproduce

Upgrade Mimir 2.8 (helm 4.4.1) to 2.9 (helm 5.0.0.) Goto dashboard Mimir / Writes

Metrics from ingester pod: cortex_request_duration_seconds_count{method="GET",route="ready",status_code="200",ws="false"} 1078

Metric from Prometheus:

cortex_request_duration_seconds{cluster="grafana-mimir", container="ingester", endpoint="http-metrics", instance="10.0.12.187:8080", job="mimir/ingester", method="GET", namespace="mimir", pod="grafana-mimir-ingester-zone-a-0", route="ready", service="grafana-mimir-ingester-zone-a", status_code="200", ws="false"}
{ count:5 sum:0.000087135 (0.000010789593218788871,0.000011766134837401892]:2 (0.000011766134837401892,0.000012831061023768835]:1 (0.000016639827463764308,0.0000181458605194507]:1 (0.000033279654927528616,0.0000362917210389014]:1 }

Expected behavior

Should be in Prometheus: cortex_request_duration_seconds_count{cluster="grafana-mimir", container="ingester", endpoint="http-metrics", instance="10.0.12.179:8080", job="mimir/ingester", method="GET", namespace="mimir", pod="grafana-mimir-ingester-zone-a-0", route="metrics", service="grafana-mimir-ingester-zone-a", status_code="200", ws="false"}

Environment

Kubernetes: 1.26 Mimir: 2.9 (helm chart 5.0.0.) Prometheus: 2.45.0 Grafana: Grafana v10.0.1

pstibrany commented 1 year ago

Hi.

Is your Prometheus configured to scrape "native histograms"? (--enable-feature=native-histograms) In that case Prometheus will ignore "classic" histogram with the same name (cortex_request_duration_seconds in this case). In Prometheus version 2.45.0 and later you can enable scraping of "classic" histograms too by setting scrape_classic_histograms option in scrape config section of your Prometheus config file.

mkovsher commented 1 year ago

Hi.

Is your Prometheus configured to scrape "native histograms"? (--enable-feature=native-histograms) In that case Prometheus will ignore "classic" histogram with the same name (cortex_request_duration_seconds in this case). In Prometheus version 2.45.0 and later you can enable scraping of "classic" histograms too by setting scrape_classic_histograms option in scrape config section of your Prometheus config file.

Hi. Thanks for the quick response.

Yes, my Promethus has native-histograms feature.

I've conducted several tests based on your recommendation: 1. MimirPlay - it works. I received 2 type of histograms (classic and new). I tested on v.2.8-2.9 I added --enable-feature=native-histograms to run Prometheus and added scrape_classic_histograms: true to job in Prometheus config.

HOWEVER... 2. Manual My Prometheus. I added scrape_classic_histograms: true manually to each Mimir job in the scrape_configs section, but didn't get the classic metric as expected, only the new one. (((

3. Helm. I added the option scrape_classic_histograms: true to the additionalScrapeConfigs section for each job added by Prometheus-operator and got an error: failed to reload config: couldn't load configuration (--config.file="/etc/prometheus/config_out/prometheus.env.yaml"): parsing YAML file /etc/prometheus/config_out/prometheus.env.yaml: found multiple scrape configs with job name "serviceMonitor/mimir/grafana-mimir-alertmanager/0" I've looked at the generated config file and I have 2 entries for each Mimir job:

  1. Prometheus-operator generated from the service monitor.
  2. Generated from the additionalScrapeConfigs section.

=== Perhaps you know the answer:

  1. why didn't it work when I added this option manually to jobs?
  2. How can I add this parameter to helm if I use ServiceMonitor?
  3. Maybe there is a way to specify this parameter globally in Prometheus?
  4. Why this metric not converted in version 2.8, but began to change in version 2.9?
pstibrany commented 1 year ago

why didn't it work when I added this option manually to jobs?

This option is only supported since Prometheus 2.45.0. Do you use this version everywhere?

How can I add this parameter to helm if I use ServiceMonitor?

I'm not very familiar with Helm, and I don't know answer to this.

Maybe there is a way to specify this parameter globally in Prometheus?

I haven't seen such option. I would suggest opening an issue about it in Prometheus. Another option is to disable native histograms in Prometheus. It's new experimental feature, and if you don't use it yet, it may be better to disable for now.

Why this metric not converted in version 2.8, but began to change in version 2.9?

Mimir 2.9 started exporting cortex_request_duration_seconds as native histogram too. However it's up to the client like Prometheus to decide whether it will scrape native histograms or not.

mkovsher commented 1 year ago

This option is only supported since Prometheus 2.45.0. Do you use this version everywhere?

Yes, we use version 2.45.0 everywhere.

I haven't seen such option. I would suggest opening an issue about it in Prometheus. Another option is to disable native histograms in Prometheus. It's new experimental feature, and if you don't use it yet, it may be better to disable for now.

I will consider this option (disable feature) :-).

Mimir 2.9 started exporting cortex_request_duration_seconds as native histogram too. However it's up to the client like Prometheus to decide whether it will scrape native histograms or not.

Ok. It remains only to configure Prometheus.

Thanks.

mkovsher commented 1 year ago

If Mimir sends the native histograms, will the dashboards be modified with this in mind?

pstibrany commented 1 year ago

If Mimir sends the native histograms, will the dashboards be modified with this in mind?

Mimir currently exposes single histogram as "native histogram" -- cortex_request_duration_seconds. In my opinion the feature needs to be widely deployed and not marked as experimental in Prometheus and Mimir, before we start using native histograms more widely.