Open richardmoe opened 2 weeks ago
Hello, does this always happen to you on the last sample?
Note that Mimir doesn't offer isolation, because of its distributed fashion. When series for different buckets of an histogram are written, there's a moment when some of them are written but others still aren't. If the query is executed at that specific moment, the histogram_quantile function may only see higher buckets but not the lower ones, thus increasing the p99 value.
There's no easy fix for this on classic histograms, and we are not planning to fix it because this issue doesn't exist in native histograms which are becoming stable now with the release of Prometheus 3.0.
Please, reopen the issue if you see this happening consistently in samples that were already written "a while ago".
Hi again, we can also see the issue in metrics written a while ago. Here is an example of graph over metrics written almost 2 weeks ago:
In this case, I would recommend you digging down to a single histogram series and check what's going on with the buckets.
I would check one of the pods that differs, and query an instant query of that in grafana as: rec_api_request_latency_bucket{...}[$__range]
. That will show you the raw data stored, and you could check what's going on.
The data from an instant query looks pretty similar and I haven't been able to see any big difference there.
You need to switch Format
to Time series
to render them as graphs, and I'd recommend you rendering both datasources on the same graph if you want to compare (use mixed
data source, then choose a data-source per query).
To get a time series graph you need range or both type.
You're still rendering the histogram_quantile, that's why you can't render time series from an instant query, please see my suggestion above:
I would check one of the pods that differs, and query an instant query of that in grafana as:
rec_api_request_latency_bucket{...}[$__range]
Something like this:
Describe the bug
We have observed unexpected behavior when using the
rate()
function on histogram metrics in Mimir compared to Prometheus. Specifically, we sporadically see a significant spike in the Mimir output that is not present in Prometheus.To Reproduce
Expected behavior
Expected to see the same results in Prometheus and Mimir.
Environment
Additional Context