Closed danielnelson closed 2 years ago
In a staging environment I have ran into this error:
panicked at ‘index out of bounds: the len is 10939 but the index is 10939’, /usr/local/cargo/registry/src/github.com-1ecc6299db9ec823/sketches-ddsketch-0.1.3/src/store.rs:155:22
I'll try to find a simple way to reproduce.
First off, thank you for submitting a solution for this problem. :)
I'll need to see what the benchmarks look like, but the tracking of a timestamp per recorded value in a histogram is a bit of a red flag to me, performance-wise.
There's some thoughts I have in mind for how to avoid it while still providing proper time bucketing/rolling, but I'll wait to see what the benchmarks look like first before getting into any refactoring.
Alright, apologies for the delay.
After looking at the change, and running metrics-benchmark
this adds more overhead than I'm comfortable with. Throughput seems to drop by around ~15%, which means it's actually higher because the benchmark app basically does a loop of "counter, gauge, histogram", so if 1/3" of it is contributing an overall ~15% slowdown.... but anyways, this is also in conjunction with the fact it's forcing everyone else using Handle
to have to upgrade in the future and pay that cost, even if they don't need to.
I think what I'd want to see is localizing the change to metrics-exporter-prometheus
. This should be doable without out much additional effort by creating a new Storage
implementation in metrics-exporter-prometheus
that emulates what you did to AtomicStorage<K>
. You'd similarly have to create something like a newtype wrapper around AtomicBucket<(f64, Instant)>
and impl HistogramFn
.. but it should be doable.
As I try and get closer and closer to what a 1.0 release might look like, I'm striving to avoid continually changing every crate in the workspace due to changes that primarily occur in only one of them.
That makes sense, I will try to make the suggested changes this week.
I have updated the PR as suggested, please let me know what you think. BTW, the issue linked above with ddsketch is not yet resolved and will need to be before this could be merged.
Taking a cursory look, this a lot more in line with what I was thinking. 👍🏻
There's probably some performance to be squeezed out if we lowered the timing accuracy by bucketing with a background task on an interval, rather than getting the timestamp at record time, but I don't think this matters a ton for users using the Prometheus exporter anyways.
I'll do a more in-depth path, including looking at the DDSketch issue you mentioned, in the next few days. Thank you for making the requested changes.
Bah, looks like indexmap
is unhappy on Rust 1.55. I'll get the MSRV bumped to 1.56 which should fix that and then we can see what CI looks like for this.
The ddsketch issue has been fixed in a new 0.2.0 release. With it we can simplify the Summary
, we no longer need the negative sketch or zero handling since it is part of the DDSketch
type. I'll update this PR to contain the changes.
I updated the CI workflow to bump the MSRV to 1.56.1 which seems to get the MSRV-specific unit tests passing again, so I think you should be good if you merge in main
.
Thanks for your contribution! <3
Released as metrics-exporter-prometheus@v0.11.0
.
This PR attempts to address the issue described in #269 where Prometheus Summaries become saturated. I added a RollingSummary type that manages a list of Summary so that old values can be retired.
There is a breaking change in
AtomicStorage
:I tested using a program that produces a test pattern, every two minutes it switches between uniform random values between 0-100 and 100-200.
I also graph the summary
rate(sum[1m]) / rate(count[1m])
for comparison to the 50th quantile.Before this patch:
With this patch: