umbrant / QuantileEstimation

Streaming estimation of percentiles, especially high percentiles.
Apache License 2.0
63 stars 7 forks source link

The CKMS doesn't maintain the quantiles summary correctly #2

Open rafis1 opened 9 years ago

rafis1 commented 9 years ago

The CKMS has been very helpful to me. After digging into the details I found that it doesn't follow the paper correctly resulting in summaries that are much larger than required. See details in HBASE Jira: https://issues.apache.org/jira/browse/HBASE-14324

vlsi commented 1 year ago

Just for reference, the paper itself is buggy.

Here's my assessment (I contacted Graham Cormode in 2012),


from: Vladimir Sitnikov

Looks like the proof of the Theorem 3 for the case (ii) is not correct.

It reads as follows (I omit _i subscripts and use q instead of \phi): (1-q)r+2en-2er > (q+e-q*q-eq)n Then it somehow results in (1-q)r-2er > (1-q)(q-e)n==(q-e-q*q +(!) e*q)n However, if one carefully moves 2en to the right side, it would result in (1-q)r-2er > (q-e-q*q -(!) e*q)n Note the correct sign before e*q is "-", not "+"

The testcase that gives an error exceeding the "e" is as follows: 1) Prepare for computation of a single quantile q=0.9, e=0.04 2) Insert values 0, 9999, then 1..9998. Compress should be omitted for simplification. This order ensures mininum and maximum are initialized at the very beginning, thus all the further items would have non-zero delta=f(r,n)-1. 3) Run query(0.9). It somehow results in 7000, that it 0.7 quantile instead of desired 0.9+-0.04.

[1] Effective Computation of Biased Quantiles over Data Streams


from: Vladimir Sitnikov

For instance, suppose we have 10'000 items in the list (just added 10K elements and never run compress). If the next item to be inserted has a rank of 7000, the Delta would be 2e(n-r)/(1-q) = 20.04(10000-7000)/(1-0.9) = 2400. That means, r_i+g_i+Delta_i=7000+1+2400=9401 that is near to the point to convince the algorithm to use that 7000th item as a 0.9 quantile even though we desire just 4%rank error (that are the ranks 9000+-400 for a 0.9 quantile).

Here is my implementation of CKMS: https://github.com/vlsi/QuantileEstimation As I configure estimator to compute 0.9 quantile with 0.04 error, it results in 7799 that is clearly off. I am not sure if I got a 100% correct implementation, however I can find no better one (I had to fork Andrew's project and fix a horde of bugs).

Here is the line produced by QuantileEstimationCKMS: real Q(0,900) = 8999, estimated Q(0,900) = 7799 == real Q(0,780 {off by -0,120, eps=0,040}) (13,3% error) That means the following: real 0.9-quantile is 8999. estimation resulted in 7799, that is 0.78 quantile. 0.78-0.9=-0.12, however, the error should not exceed 0.04

Relative error of a given estimation is 13%=(8999-7799)/8999. I understand it has nothing to do with eps in the algorithm, but I use it to get the actual error of the estimator.


From: Graham

Thanks for this example, I see the problem now. The invariant set up for Targeted Quantiles in Defn 5 in the paper is problematic, in that it lets the allowed error to grow too quickly, leading to the kind of contradictions that you identify. To fix this, a stronger invariant may be required which relaxes the error more slowly. For example, we could set

f_j(r_i, n) = eps_j ( n + | r_i - phi_j n |)

However, this is too conservative, since it gives large error across a broad range.

An alternate solution might be to focus on the main biased quantiles approach, and given target (eps, phi), answer this using the biased quantiles algorithm for (eps/phi) [for phi < 1/2] or (eps/(1-phi)) [for phi > 1/2]. This will work for the set of targets (0.9, 0.95, 0.99) that you mentioned in your initial message.

vlsi commented 1 year ago

Here's the commit that brings implementation in line with the original paper: https://github.com/vlsi/QuantileEstimation/commit/20f5d9677fcd97fdab1f7ba090cba91d6f4c42ee, however, the theorem in the paper is invalid, so even "improved implementation" produces very high errors.