metrics-rs / metrics

A metrics ecosystem for Rust.
MIT License
1.13k stars 158 forks source link

Add RollingSummary to prevent summary saturation #306

Closed danielnelson closed 2 years ago

danielnelson commented 2 years ago

This PR attempts to address the issue described in #269 where Prometheus Summaries become saturated. I added a RollingSummary type that manages a list of Summary so that old values can be retired.

There is a breaking change in AtomicStorage:

- type Histogram = Arc<AtomicBucket<f64>>;
+ type Histogram = Arc<AtomicBucket<(f64, Instant)>>;

I tested using a program that produces a test pattern, every two minutes it switches between uniform random values between 0-100 and 100-200.

I also graph the summary rate(sum[1m]) / rate(count[1m]) for comparison to the 50th quantile.

Before this patch:

2022-06-09-163831_1347x637_scrot

With this patch:

2022-06-09-160026_1345x635_scrot

danielnelson commented 2 years ago

In a staging environment I have ran into this error:

panicked at ‘index out of bounds: the len is 10939 but the index is 10939’, /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sketches-ddsketch-0.1.3/src/store.rs:155:22

I'll try to find a simple way to reproduce.

tobz commented 2 years ago

First off, thank you for submitting a solution for this problem. :)

I'll need to see what the benchmarks look like, but the tracking of a timestamp per recorded value in a histogram is a bit of a red flag to me, performance-wise.

There's some thoughts I have in mind for how to avoid it while still providing proper time bucketing/rolling, but I'll wait to see what the benchmarks look like first before getting into any refactoring.

tobz commented 2 years ago

Alright, apologies for the delay.

After looking at the change, and running metrics-benchmark this adds more overhead than I'm comfortable with. Throughput seems to drop by around ~15%, which means it's actually higher because the benchmark app basically does a loop of "counter, gauge, histogram", so if 1/3" of it is contributing an overall ~15% slowdown.... but anyways, this is also in conjunction with the fact it's forcing everyone else using Handle to have to upgrade in the future and pay that cost, even if they don't need to.

I think what I'd want to see is localizing the change to metrics-exporter-prometheus. This should be doable without out much additional effort by creating a new Storage implementation in metrics-exporter-prometheus that emulates what you did to AtomicStorage<K>. You'd similarly have to create something like a newtype wrapper around AtomicBucket<(f64, Instant)> and impl HistogramFn.. but it should be doable.

As I try and get closer and closer to what a 1.0 release might look like, I'm striving to avoid continually changing every crate in the workspace due to changes that primarily occur in only one of them.

danielnelson commented 2 years ago

That makes sense, I will try to make the suggested changes this week.

danielnelson commented 2 years ago

I have updated the PR as suggested, please let me know what you think. BTW, the issue linked above with ddsketch is not yet resolved and will need to be before this could be merged.

tobz commented 2 years ago

Taking a cursory look, this a lot more in line with what I was thinking. 👍🏻

There's probably some performance to be squeezed out if we lowered the timing accuracy by bucketing with a background task on an interval, rather than getting the timestamp at record time, but I don't think this matters a ton for users using the Prometheus exporter anyways.

I'll do a more in-depth path, including looking at the DDSketch issue you mentioned, in the next few days. Thank you for making the requested changes.

tobz commented 2 years ago

Bah, looks like indexmap is unhappy on Rust 1.55. I'll get the MSRV bumped to 1.56 which should fix that and then we can see what CI looks like for this.

danielnelson commented 2 years ago

The ddsketch issue has been fixed in a new 0.2.0 release. With it we can simplify the Summary, we no longer need the negative sketch or zero handling since it is part of the DDSketch type. I'll update this PR to contain the changes.

tobz commented 2 years ago

I updated the CI workflow to bump the MSRV to 1.56.1 which seems to get the MSRV-specific unit tests passing again, so I think you should be good if you merge in main.

tobz commented 2 years ago

Thanks for your contribution! <3

tobz commented 2 years ago

Released as metrics-exporter-prometheus@v0.11.0.