Open sritchie opened 8 years ago
Hi, I'd like to take this up, can you please point me in the right direction. What I understand from the issue is that 'return the median value instead of the min on query.' and use this paper . Can you be a little more descriptive ? Also, which part of the code should I be looking at ?
The wiki page for CMS describes a variant with a different query function called count-mean-min-sketch; this tries to reduce bias in the estimate by subtracting out the bias when reporting.
The big change here is to return the median value instead of the min on query.
This blog post claims that count-mean-min-sketch will provide better estimates in low to moderately skewed data, where CMS becomes less accurate.
Here's how the CMS performs on moderately skewed data:
Here's how count-mean-min-sketch does, estimating the same moderately-skewed distribution:
This paper's worth looking at as well:
(EDIT: image link is now busted. This was a screenshot of the paper section describing count-mean-min-sketch.)
The paper also discusses how to combine this idea with #460 to get a count-mean-min-sketch with conservative update.