grafana / mimir

Grafana Mimir provides horizontally scalable, highly available, multi-tenant, long-term storage for Prometheus.
https://grafana.com/oss/mimir/
GNU Affero General Public License v3.0
4.16k stars 535 forks source link

Support analyzing dpm in the cardinality API #7281

Open Logiraptor opened 9 months ago

Logiraptor commented 9 months ago

Is your feature request related to a problem? Please describe.

There are two dimensions to ingestion in Mimir: space (ie unique series) and time (samples per minute). The current cardinality API is useful only for understanding the space dimension, but has no support for the time dimension.

Describe the solution you'd like

We should find some efficient way to count the dpm of active time series. I think it may be possible to extend the active series tracker with a dpm measurement based on two rotating buckets.

Essentially, we track two numbers for each series (openDpmBucket and closedDpmBucket). Each time a series is updated in the tracker, we increment the openDpmBucket. Each time the tracker is purged (ingester.active-series-metrics-update-period, default = 1m), we swap the values of openDpmBucket and closedDpmBucket, then reset openDpmBucket to 0.

Then we could compute an estimate of the dpm of any series via closedDpmBucket / UpdatePeriod. This works as long as the UpdatePeriod is greater than the actual dpm. If the actual dpm is less than one sample per UpdatePeriod, then we may miss report the dpm as 0.

Alternatively, we could do the same thing, but use the IdleTimeout as the bucket window, which would give a more useful lower bound of 0.1 dpm by default, or 0.05 dpm in Grafana Cloud.

Describe alternatives you've considered

We've seen that Grafana Cloud customers resort to expensive count_over_time queries to find the source of high dpm. One popular solution is to run the query sum by (job) (scrape_samples_scraped). This works great assuming data is coming from a prometheus instance, but in practice there are lots of ways time series data can find its way to Mimir, so there's still a gap for some users.

replay commented 6 months ago

This is going to sound a bit crazy, but I think I would prefer an alternative solution which would solve the same problem:

zhehao-grafana commented 3 weeks ago

@Logiraptor, do you have an estimate of the efforts needed to make this API happen? If customers want to use it for DPM debugging purposes, I would imagine they would prefer doing it on the front end