Closed DanSimon closed 7 years ago
Merging #509 into develop will increase coverage by
<.01%
. The diff coverage is100%
.
@@ Coverage Diff @@
## develop #509 +/- ##
===========================================
+ Coverage 84.74% 84.75% +<.01%
===========================================
Files 98 98
Lines 4288 4290 +2
Branches 348 350 +2
===========================================
+ Hits 3634 3636 +2
Misses 654 654
Impacted Files | Coverage Δ | |
---|---|---|
.../scala/colossus/metrics/collectors/Histogram.scala | 91.91% <100%> (+0.12%) |
:arrow_up: |
Continue to review full report at Codecov.
Legend - Click here to learn more
Δ = absolute <relative> (impact)
,ø = not affected
,? = missing data
Powered by Codecov. Last update 2d3610e...a21ce41. Read the comment docs.
Is this related to https://github.com/tumblr/colossus/pull/503 ?
@benblack86 no, that was due to the approximation algorithm we are using
Is it not possible to take a copy of the values before calculating the percentiles, that way you know they can't change?
@benblack86 that would actually end up leading to the same race condition, since the same off-by-one can occur while generating the snapshot. Another option is to put all of the atomic longs inside a class behind an atomic reference, so that we swap out all the counters at once, however I decided against that for now as it would double the number of atomic operations needed to add each value.
I can investigate that possibility further though and try to get some measurements,
Yeah, I'm just thinking it would be better to have a complete solution where the count is also correct so anyone looking really hard at the numbers wouldn't find the error. We can go with this as a solution, but you should probably add to the comment that this isn't a complete solution.
@DanSimon you happy for me to merge this?
Yeah it should be good to go.
What's the Bug?
Very rarely, a race condition can occur that will cause calculated tail percentiles of a histogram (> 99th percentile) to report
Int.MaxValue
.The bug is due to a race condition that can happen when a value is added to the histogram at the same time the percentiles are being calculated in another thread. What ends up happening is that the calculated number of added values is 1 higher than the actual number of added values in the histogram's buckets. At high percentiles, it ends up searching through all the buckets for the missing value, and stops at the end, resulting in the "infinity" value of
Int.MaxValue
.How did I fix it?
The percentile calculation function will now keep track of the last bucket with a non-zero count, and use that bucket instead of the last one when such a situation occurs. This also ends up slightly changing how percentiles are calculated, but this change is only apparent when relatively few values have been added to the histogram.
This does not resolve the fact that the
count
is still one off from the actual total count, but generally that will have little to no effect, especially when hundreds to thousands of values are added per second. Fixing that race condition would require doing some kind of atomic locking, which is probably not worth it.How is it tested?
I added test that successfully and consistently reproduced the race condition.
Before merging, I am going to do a bit more testing to make sure that calculated percentiles are not significantly changed.