umbrant / QuantileEstimation

Streaming estimation of percentiles, especially high percentiles.
Apache License 2.0
63 stars 7 forks source link

Fixed size of GK Algorithm? #1

Open maddenpj opened 11 years ago

maddenpj commented 11 years ago

First off a quick thanks for providing this repo, it helped me to understand this algorithm.

I'm curious as to if you are still planning on working on it, specifically finishing the GK algorithm. The paper describes a dynamic tree structure for the summary data structure as opposed to the fixed sized list. I was really hoping to see an implementation of that. Do you think you will include it in your repo or do you know of any examples that take this approach?

Thanks!

umbrant commented 11 years ago

Sure, glad that it's of use. I skipped finishing the GK algorithm since CKMS is supposed to be similar but better, though there are still some unresolved issues with the CKMS implementation as it is. I only did this for later inclusion into Hadoop so I don't have any plans to work on it further, but pull requests welcome!

maddenpj commented 11 years ago

Glad to hear back! I have only read about CKMS and not the paper but I was under the impression it was biased for higher quantiles whereas I was looking for a more general solution(?). I am actually tackling this problem (arbitrary quantiles on arbitrary streams) for work and have become very interested in it but feel a bit out of my league.

I was thinking of playing around with GK, CKMS and Q-Digest to get a better feel of what would fit my needs, do you know of any others? I don't want to invest my time implementing/benchmarking/etc if I've missed the most important one! Is there any libraries for this? What does Hadoop use?

Thanks!

umbrant commented 11 years ago

CKMS is a generalization of GK, it devolves into GK if you don't ask for high quantiles. I think that Mahout or the Google ML library might have a GK implementation if you're really interested. If you don't need strict error bounds or high quantiles, sampling is a pretty effective technique.

There's also a version for sliding windows over streams, which is essentially a bunch of windowed GK's:

http://infolab.stanford.edu/~manku/papers/04pods-sliding.pdf

On Fri, Jan 4, 2013 at 8:20 AM, Patrick Madden notifications@github.comwrote:

Glad to hear back! I have only read about CKMS and not the paper but I was under the impression it was biased for higher quantiles whereas I was looking for a more general solution(?). I am actually tackling this problem (arbitrary quantiles on arbitrary streams) for work and have become very interested in it but feel a bit out of my league.

I was thinking of playing around with GK, CKMS and Q-Digest to get a better feel of what would fit my needs, do you know of any others? I don't want to invest my time implementing/benchmarking/etc if I've missed the most important one! Is there any libraries for this? What does Hadoop use?

Thanks!

— Reply to this email directly or view it on GitHubhttps://github.com/umbrant/QuantileEstimation/issues/1#issuecomment-11888626.

dgryski commented 11 years ago

There's also http://www.cs.unc.edu/~zhangq/PUBLICATIONS/SSDBM07-Qi%20Zhang-FastQStream.pdf , "A Fast Algorithm for Approximate Quantiles in High Speed Data Streams" by Qi Zhang and Wei Wang, better description available at http://www.mathcs.emory.edu/~cheung/Courses/584-StreamDB/Syllabus/08-Quantile/Zhang.html . It seems faster to compute than GK, not sure how it deals with higher percentiles though. I've only seen references to two implementations, though, both in closed-source numerical libraries (one of which is Intel's). Not sure what that says about the algorithm though.