feat: add capability to purge old histogram data

mnpw commented 5 months ago

What

add purge_timeout option to PrometheusBuilder
run a purger that purges based on the purge_timeout

Implements third way as prescribed here to purge old histogram data:

update the builder to generate a future which both drives the Hyper server future as well as a call to get_recent_metrics on an interval

Fixes https://github.com/metrics-rs/metrics/issues/245

tobz commented 4 months ago

As I read back through what I originally wrote... I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

I think what we'd actually have to do here is split the logic out of get_recent_metrics that handles draining the histograms. Essentially, take this code and stick it into a new method on the struct -- let's call it run_upkeep for this example -- and trim it down so it only iterates through each histogram, finds/creates the entry in self.distributions, and then drains samples into the entry. After that, we'd add a call to that method right before this block, and in that block we'd remove the logic where it drains the histograms.

Does that make sense?

mnpw commented 4 months ago

Thanks for the update!

I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

Does this refer to the self.recency.should_store_* call where we check the validity of a metric? If so then I think eagerly calling this method would be fine. It would be helpful to understand your concern with an example.

tobz commented 4 months ago

Thanks for the update!

I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

Does this refer to the self.recency.should_store_* call where we check the validity of a metric? If so then I think eagerly calling this method would be fine. It would be helpful to understand your concern with an example.

(Sorry for the delayed response here.)

Yes, you're right that this would be the crux of the concern.

The example would be something like:

the exporter is configured with an idle timeout of 10 seconds, and a purger interval of 5s
the exporter is scraped at t=0, which includes metric A
metric A is updated at t=5 (meaning it will go idle at t=15 if not updated by then)
the purger runs at t=5 and t=10 and since nothing is idle yet, all it ends up doing is draining the histogram buckets
the purger again runs at t=15, but now, metric A is considered idle, which removes metric A from the registry
the exporter is scraper at t=30, which now no longer includes metric A

Despite metric A being updated between the two scrapes at t=0 and t=30, we've actually missed the update to metric A, which if we were only rendering during scrapes, would have been included in the second scrape.

mnpw commented 4 months ago

Despite metric A being updated between the two scrapes at t=0 and t=30, we've actually missed the update to metric A, which if we were only rendering during scrapes, would have been included in the second scrape.

Ah this makes sense once I checked documentation for PrometheusBuilder::idle_timeout:

This behavior is driven by requests to generate rendered output, and so metrics will not be removed unless a request has been made recently enough to prune the idle metrics.

I have separated out the purge action to only drain the histograms now.

metrics-rs / metrics

feat: add capability to purge old histogram data #451

What

run a purger that purges based on the purge_timeout