grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
3.99k stars 503 forks source link

rate/irate on large counters sometimes creates ghost counter resets #5198

Open nlowe-sx opened 1 year ago

nlowe-sx commented 1 year ago

Describe the bug

While attempting to migrate from Cortex 1.10.0 to Mimir 2.8.0 we observed some strange behavior with rate() and irate() queries. Sometimes, for very large counters, rate()/irate() seem to insert ghost / synthetic counter resets, jumping the reported rate by 10s of orders of magnitude for a few data points. Comparing the query against the prometheus instance that remote wrote the data we do not see this behavior so we do not believe it is a problem on the ingest side. Additionally, Cortex 1.10.0, which our clusters are configured to still remote_write to, does not exhibit this problem.

Initially, we saw this behavior in the p99 cortex_request_duration_seconds_bucket histogram for distributors and ingesters where the p99 would jump to several minutes from a nominal level of 25ms and thought it might be a data reporting problem with mimir itself. However, we recently discovered and extended period of time (several days) where this happened for the node_network_transmit_bytes_total metric for a few interfaces on one of our nodes.

In Grafana, the problem looks like this:

If we query the irate(), we can see a very large spike from a value that is nominally single-digit Gbit/sec to 100s of millions of Gbit/sec:

irate

Looking at the raw counter, there are no counter resets and this jump does not match the slope:

raw

If we take the slope directly with deriv instead it looks normal:

deriv

Here's another example we can compare with data in prometheus. Again, the counter on request rate spikes many orders of magnitude:

grafana

Prometheus disagrees:

prometheus

If we ask mimir about counter resets, it thinks it sees one:

resets2

Again, prometheus disagrees:

resets3

The only commonality I can find between these two series is they are large (values in excess of 1 Billion).

We have observed this problem in both compacted data in historical queries and recent data served from ingesters.

We have tried disabling query sharding (even though these queries should not be shard-able) and did not notice any change.

To Reproduce

Steps to reproduce the behavior:

  1. Start Mimir 2.8.0 in microservices mode
  2. Scrape something that produces a rapidly increasing counter and remote write it into mimir
  3. Query the rate of that counter

Expected behavior

The rate roughly matches the slope of the counter

Environment

Additional Context

We're trying to reproduce this synthetically to make it easier to troubleshoot. Please let me know if there's any additional data that would be useful to provide and I can try to work through getting that cleared to share.

dimitarvdimitrov commented 1 year ago

Thank you for reporting this.

Can you send a screenshot of the raw series' samples from this query? Normally in a range query promQL groups data points into steps and you might lose some resolution. You can get the raw samples by running this instant query in grafana

node_network_transmit_bytes_total{device="..."}[$__range]
nlowe-sx commented 1 year ago

Huh. That looks weird to me. There seems to be two samples for each timestamp and if mimir happens to treat them as sequential samples and the second value for the timestamp happens to be lower, then I could see how that could cause it to think the counter was reset. I've omitted most of the labels here to make the table easier to read but this is indeed all one series.

node_network_transmit_bytes_total{device="...",node="..."}[4m] | `Time` | `__name__` | `Value` | |-----------------------|-------------------------------------|-------------------| | `2023-05-22 19:45:26` | `node_network_transmit_bytes_total` | `745105499837576` | | `2023-05-22 19:45:36` | `node_network_transmit_bytes_total` | `745106005752138` | | `2023-05-22 19:45:26` | `node_network_transmit_bytes_total` | `745105494849075` | | `2023-05-22 19:45:36` | `node_network_transmit_bytes_total` | `745106008475545` | | `2023-05-22 19:45:46` | `node_network_transmit_bytes_total` | `745106524145616` | | `2023-05-22 19:45:46` | `node_network_transmit_bytes_total` | `745106526114699` | | `2023-05-22 19:45:56` | `node_network_transmit_bytes_total` | `745107515449287` | | `2023-05-22 19:45:56` | `node_network_transmit_bytes_total` | `745107492627113` | | `2023-05-22 19:46:06` | `node_network_transmit_bytes_total` | `745109010062291` | | `2023-05-22 19:46:06` | `node_network_transmit_bytes_total` | `745108997506179` | | `2023-05-22 19:46:16` | `node_network_transmit_bytes_total` | `745109596982426` | | `2023-05-22 19:46:16` | `node_network_transmit_bytes_total` | `745109599677196` | | `2023-05-22 19:46:26` | `node_network_transmit_bytes_total` | `745110285527594` | | `2023-05-22 19:46:26` | `node_network_transmit_bytes_total` | `745110295049076` | | `2023-05-22 19:46:36` | `node_network_transmit_bytes_total` | `745111362870260` | | `2023-05-22 19:46:36` | `node_network_transmit_bytes_total` | `745111366354921` | | `2023-05-22 19:46:46` | `node_network_transmit_bytes_total` | `745112670942764` | | `2023-05-22 19:46:46` | `node_network_transmit_bytes_total` | `745112673546857` | | `2023-05-22 19:46:56` | `node_network_transmit_bytes_total` | `745113981581520` | | `2023-05-22 19:46:56` | `node_network_transmit_bytes_total` | `745114005947801` |

Here's another interesting one that happened earlier today so I can compare against prometheus. This one seems to have a single data point with two values for the same time stamp:

Mimir cortex_request_duration_seconds_count{cluster="...",container="ingester",route="/cortex.Ingester/Push",status_code="success",pod="ingester-zone-f-14"}[4m] |`Time` | `__name__` | `Value` | |----------------------|-----------------------------------------|--------------| |`2023-06-08 09:11:28` | `cortex_request_duration_seconds_count` | `2416452550` | |`2023-06-08 09:11:38` | `cortex_request_duration_seconds_count` | `2416488545` | |`2023-06-08 09:11:48` | `cortex_request_duration_seconds_count` | `2416524799` | |`2023-06-08 09:11:58` | `cortex_request_duration_seconds_count` | `2416562203` | |`2023-06-08 09:12:08` | `cortex_request_duration_seconds_count` | `2416598863` | |`2023-06-08 09:12:18` | `cortex_request_duration_seconds_count` | `2416636616` | |`2023-06-08 09:12:28` | `cortex_request_duration_seconds_count` | `2416673083` | |`2023-06-08 09:12:38` | `cortex_request_duration_seconds_count` | `2416709097` | |`2023-06-08 09:12:48` | `cortex_request_duration_seconds_count` | `2416746326` | |`2023-06-08 09:13:08` | `cortex_request_duration_seconds_count` | `2416820951` | |`2023-06-08 09:13:18` | `cortex_request_duration_seconds_count` | `2416858209` | | **Duplicate** `2023-06-08 09:13:28` | `cortex_request_duration_seconds_count` | `2416894333` | | **Duplicate** `2023-06-08 09:13:28` | `cortex_request_duration_seconds_count` | `2416894309` | |`2023-06-08 09:13:38` | `cortex_request_duration_seconds_count` | `2416932069` | |`2023-06-08 09:13:48` | `cortex_request_duration_seconds_count` | `2416970282` | |`2023-06-08 09:13:58` | `cortex_request_duration_seconds_count` | `2417006752` | |`2023-06-08 09:14:08` | `cortex_request_duration_seconds_count` | `2417043879` | |`2023-06-08 09:14:18` | `cortex_request_duration_seconds_count` | `2417080808` | |`2023-06-08 09:14:28` | `cortex_request_duration_seconds_count` | `2417117187` | |`2023-06-08 09:14:38` | `cortex_request_duration_seconds_count` | `2417154370` |

And here's what prometheus sees:

Prometheus cortex_request_duration_seconds_count{container="ingester",route="/cortex.Ingester/Push",status_code="success",pod="ingester-zone-f-14"}[4m] Replica 0 ``` cortex_request_duration_seconds_count{...} 2416304474 @1686215448.612 2416341960 @1686215458.612 2416378396 @1686215468.612 2416415637 @1686215478.612 2416452049 @1686215488.612 2416487733 @1686215498.612 2416524106 @1686215508.612 2416561753 @1686215518.612 2416598394 @1686215528.612 2416636181 @1686215538.612 2416672347 @1686215548.612 2416708364 @1686215558.612 2416745929 @1686215568.612 2416783813 @1686215578.612 2416820514 @1686215588.612 2416857630 @1686215598.612 2416894333 @1686215608.612*** 2416931466 @1686215618.612 2416969735 @1686215628.612 2417006362 @1686215638.612 2417043371 @1686215648.612 2417080122 @1686215658.612 2417116864 @1686215668.612 2417154099 @1686215678.612 ``` Replica 1 ``` cortex_request_duration_seconds_count{...} 2416305369 @1686215448.753 2416342508 @1686215458.753 2416379755 @1686215468.753 2416415993 @1686215478.753 2416452550 @1686215488.753 2416488545 @1686215498.753 2416524799 @1686215508.753 2416562203 @1686215518.753 2416598863 @1686215528.753 2416636616 @1686215538.753 2416673083 @1686215548.753 2416709097 @1686215558.753 2416746326 @1686215568.753 2416784000 @1686215578.753 2416820951 @1686215588.753 2416858209 @1686215598.753 2416894309 @1686215608.753*** 2416932069 @1686215618.753 2416970282 @1686215628.753 2417006752 @1686215638.753 2417043879 @1686215648.753 2417080808 @1686215658.753 2417117187 @1686215668.753 2417154370 @1686215678.753 ```

Does this indicate a problem with the HA Tracker?

nlowe-sx commented 1 year ago

We tried turning off OOO sample ingestion and are still seeing these weird samples show up every so often. At this point we're unable to trust mimir to handle our data and cannot complete our migration. Is there any additional data I can provide that would make it easier to troubleshoot this?

nlowe-sx commented 1 year ago

We recently changed our HA Tracker store from etcd to consul and haven't experienced this bug in 9+ days. We will keep monitoring but for now this seems to be related to etcd perhaps? Looking at the cortex_request_duration_seconds_count series that seems to reproduce this bug the most, we haven't seen any spikes since cutting over to consul (the blue annotation in the below screenshot). Our nominal rate for this series is around 120k / sec.

image

jmichalek132 commented 1 year ago

Hi, we are affected by this too, we are using the opentelemetry-collector to send metrics into mimir. We see the exact same issue happening, particularly on a metric coming from otel with high value k8s_pod_network_io. We are also running 2.8.0 version of mimir. However we are running otel as a daemonset so we don't use the HA tracker feature for deduplication because we send the metric only once. I can provide the same level of details if requested. I will also upgrade to 2.9.0 soon to see if that helps.

jmichalek132 commented 1 year ago

Finally, I found time to upgraded mimir to 2.9.0 can confirm it's still happening.

We can see a reset is detected. Screenshot 2023-08-07 at 12 44 07

The raw query: Screenshot 2023-08-07 at 12 45 51

Data from the query:

    "data": {
      "values": [
        [
          1691403780000,
          1691403840000,
          1691403900000,
          1691403960000,
          1691404020000,
          1691404080000,
          1691404140000,
          1691404200000,
          1691404260000,
          1691404320000,
          1691404380000,
          1691404440000,
          1691404500000,
          1691404560000
        ],
        [
          616327985923,
          616410676205,
          616479558915,
          616562515237,
          616644059848,
          616697027902,
          616768046955,
          616850655521,
          616913982572,
          616995731837,
          617071140070,
          617126787917,
          617222044703,
          617303448582
        ]
      ]
    }
a517363523 commented 1 year ago

image image I also encountered such a problem

bboreham commented 9 months ago

Does this indicate a problem with the HA Tracker?

HA-Tracker picks one of two sources of data for the same series, intended for situations where one of the sources stops sending.

If, at the point of fail-over, the other source is sending a lower value for the same counter, this will look like a reset with a big jump. I.e. it matches the reported symptom.

The expectation is that both "HA" sources are scraping the same data source, so they should send the same values. I can't immediately think of a straightforward way in which this can go wrong. Also we have not had many reports of this symptom.

We recently changed our HA Tracker store from etcd to consul and haven't experienced this bug in 9+ days.

Again, I do not know any reason why that would be the case.

At this time I can only suggest inspecting the various components for overload. Anything else you can add may be useful.

ChrisSteinbach commented 9 months ago

We are seeing exactly the same thing: a false reset that gives spurious results for rate queries. Oddly this somehow fixes itself if we add resolution to the query so that instead of:

rate(container_cpu_usage_seconds_total{container!=""}[10m]) before

we give something like,

rate(container_cpu_usage_seconds_total{container!=""}[10m:3m]) after

Here's the reset i.e. resets(container_cpu_usage_seconds_total{}[10m]): reset

And the actual counter showing no resets for the same period: counter

We are running Mimir 2.9.0

dimitarvdimitrov commented 9 months ago

And the actual counter showing no resets for the same period:

How did you run this query? You can try running this as an instant query and then inspecting the samples around 15:45

container_cpu_usage_seconds_total{}[$__range]
ChrisSteinbach commented 9 months ago

@dimitarvdimitrov We ran the query like this, except with one pod selected:

container_cpu_usage_seconds_total{}

If I run the query you suggest, it looks like this: query

So there does indeed appear to be a small "reset", but the same artefact is not present in the original data scraped from a Prometheus instance when I zoom in to the same spot using the Prometheus GUI.

dimitarvdimitrov commented 9 months ago

@ChrisSteinbach does this mean that there is only one Prometheus remote-writing to Mimir? So you're not using the distributor HA tracker? (If the answer is "no," then can you check the exact sample value and timestamps on both HA prometheis at ~15:45:30)

ChrisSteinbach commented 9 months ago

@dimitarvdimitrov we're scraping from Prometheus using a grafana agent, but you may be on to something. I'm not the administrator of the Prometheus we scrape, but I can see that there are two instances at work which suggests an HA setup. Unfortunately I can't address them individually. It's something for us to look into anyhow.