metrics-rs / metrics

A metrics ecosystem for Rust.
MIT License
1.06k stars 145 forks source link

feat: add capability to purge old histogram data #451

Closed mnpw closed 4 months ago

mnpw commented 5 months ago

What

Implements third way as prescribed here to purge old histogram data:

  • update the builder to generate a future which both drives the Hyper server future as well as a call to get_recent_metrics on an interval

Fixes https://github.com/metrics-rs/metrics/issues/245

tobz commented 4 months ago

As I read back through what I originally wrote... I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

I think what we'd actually have to do here is split the logic out of get_recent_metrics that handles draining the histograms. Essentially, take this code and stick it into a new method on the struct -- let's call it run_upkeep for this example -- and trim it down so it only iterates through each histogram, finds/creates the entry in self.distributions, and then drains samples into the entry. After that, we'd add a call to that method right before this block, and in that block we'd remove the logic where it drains the histograms.

Does that make sense?

mnpw commented 4 months ago

Thanks for the update!

I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

Does this refer to the self.recency.should_store_* call where we check the validity of a metric? If so then I think eagerly calling this method would be fine. It would be helpful to understand your concern with an example.

tobz commented 4 months ago

Thanks for the update!

I think a major flaw with my proposal to do this by periodically calling render is that we would actually also handle things like expiration, which could lead to subsequent scrapes/pushes missing data.

Does this refer to the self.recency.should_store_* call where we check the validity of a metric? If so then I think eagerly calling this method would be fine. It would be helpful to understand your concern with an example.

(Sorry for the delayed response here.)

Yes, you're right that this would be the crux of the concern.

The example would be something like:

Despite metric A being updated between the two scrapes at t=0 and t=30, we've actually missed the update to metric A, which if we were only rendering during scrapes, would have been included in the second scrape.

mnpw commented 4 months ago

Despite metric A being updated between the two scrapes at t=0 and t=30, we've actually missed the update to metric A, which if we were only rendering during scrapes, would have been included in the second scrape.

Ah this makes sense once I checked documentation for PrometheusBuilder::idle_timeout:

This behavior is driven by requests to generate rendered output, and so metrics will not be removed unless a request has been made recently enough to prune the idle metrics.

I have separated out the purge action to only drain the histograms now.