twitter / algebird

Abstract Algebra for Scala
https://twitter.github.io/algebird
Apache License 2.0
2.29k stars 346 forks source link

implement count-mean-min-sketch #559

Open sritchie opened 7 years ago

sritchie commented 7 years ago

The wiki page for CMS describes a variant with a different query function called count-mean-min-sketch; this tries to reduce bias in the estimate by subtracting out the bias when reporting.

The big change here is to return the median value instead of the min on query.

This blog post claims that count-mean-min-sketch will provide better estimates in low to moderately skewed data, where CMS becomes less accurate.

Here's how the CMS performs on moderately skewed data: image

Here's how count-mean-min-sketch does, estimating the same moderately-skewed distribution: image

This paper's worth looking at as well:

image (EDIT: image link is now busted. This was a screenshot of the paper section describing count-mean-min-sketch.)

The paper also discusses how to combine this idea with #460 to get a count-mean-min-sketch with conservative update.

gamebusterz commented 7 years ago

Hi, I'd like to take this up, can you please point me in the right direction. What I understand from the issue is that 'return the median value instead of the min on query.' and use this paper . Can you be a little more descriptive ? Also, which part of the code should I be looking at ?