tumblr / colossus

I/O and Microservice library for Scala
Apache License 2.0
1.14k stars 96 forks source link

Fixing Histogram race condition #509

Closed DanSimon closed 7 years ago

DanSimon commented 7 years ago

What's the Bug?

Very rarely, a race condition can occur that will cause calculated tail percentiles of a histogram (> 99th percentile) to report Int.MaxValue.

The bug is due to a race condition that can happen when a value is added to the histogram at the same time the percentiles are being calculated in another thread. What ends up happening is that the calculated number of added values is 1 higher than the actual number of added values in the histogram's buckets. At high percentiles, it ends up searching through all the buckets for the missing value, and stops at the end, resulting in the "infinity" value of Int.MaxValue.

How did I fix it?

The percentile calculation function will now keep track of the last bucket with a non-zero count, and use that bucket instead of the last one when such a situation occurs. This also ends up slightly changing how percentiles are calculated, but this change is only apparent when relatively few values have been added to the histogram.

This does not resolve the fact that the count is still one off from the actual total count, but generally that will have little to no effect, especially when hundreds to thousands of values are added per second. Fixing that race condition would require doing some kind of atomic locking, which is probably not worth it.

How is it tested?

I added test that successfully and consistently reproduced the race condition.

Before merging, I am going to do a bit more testing to make sure that calculated percentiles are not significantly changed.

codecov-io commented 7 years ago

Codecov Report

Merging #509 into develop will increase coverage by <.01%. The diff coverage is 100%.

Impacted file tree graph

@@             Coverage Diff             @@
##           develop     #509      +/-   ##
===========================================
+ Coverage    84.74%   84.75%   +<.01%     
===========================================
  Files           98       98              
  Lines         4288     4290       +2     
  Branches       348      350       +2     
===========================================
+ Hits          3634     3636       +2     
  Misses         654      654
Impacted Files Coverage Δ
.../scala/colossus/metrics/collectors/Histogram.scala 91.91% <100%> (+0.12%) :arrow_up:

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 2d3610e...a21ce41. Read the comment docs.

benblack86 commented 7 years ago

Is this related to https://github.com/tumblr/colossus/pull/503 ?

amotamed commented 7 years ago

@benblack86 no, that was due to the approximation algorithm we are using

benblack86 commented 7 years ago

Is it not possible to take a copy of the values before calculating the percentiles, that way you know they can't change?

DanSimon commented 7 years ago

@benblack86 that would actually end up leading to the same race condition, since the same off-by-one can occur while generating the snapshot. Another option is to put all of the atomic longs inside a class behind an atomic reference, so that we swap out all the counters at once, however I decided against that for now as it would double the number of atomic operations needed to add each value.

I can investigate that possibility further though and try to get some measurements,

benblack86 commented 7 years ago

Yeah, I'm just thinking it would be better to have a complete solution where the count is also correct so anyone looking really hard at the numbers wouldn't find the error. We can go with this as a solution, but you should probably add to the comment that this isn't a complete solution.

benblack86 commented 7 years ago

@DanSimon you happy for me to merge this?

DanSimon commented 7 years ago

Yeah it should be good to go.