postmates / quantiles

Approximations to Histograms
MIT License
63 stars 7 forks source link

Error bound check for CKMS appears to be off #27

Open c0deaddict opened 4 years ago

c0deaddict commented 4 years ago

I'm trying to make Elixir bindings for this library. During testing of the bindings I noticed that in my test quantile queries were off by more than the error bound (compared to the quantiles on a sorted list). Reading through the tests in this project, I noticed that the error bound check is not using .abs():

https://github.com/postmates/quantiles/blob/04b0351abdede28afe7d89433505e6df8abb926f/src/ckms/mod.rs#L376

The test unfortunately fails when mentioned line is changed to (v - percentile(&data, prcnt)).abs() < err.

Is this the paper on which the algorithm is based? http://www.dimacs.rutgers.edu/~graham/pubs/papers/bq-pods.pdf I tried to read through it to discover what the error bound is, but could not find it.

blt commented 4 years ago

Is it possible for you to share your test code, or at least the data you're using?

Regarding the paper, that's the one per this discussion but back when I originally implemented this library the one available freely online had a bug in its algorithm. I know the IEEE is accurate per discussion with Graham Cormode.

If you can share your test data I'll try and reproduce the issue in this library's test code.

c0deaddict commented 4 years ago

My Elixir code is a mess atm, but I wrote a small test in Rust that shows what I mean. Below program works fine for n = 100 or n = 1000, but fails for n = 10000 or bigger n. It does not seems to depend on the random seed.

What i'm unsure about is how to correctly compute the exact quantile: should the index be ceil, floor or round?

use quantiles::ckms::CKMS;
use rand::Rng;
use std::cmp;

fn main() {
    let error = 0.001;
    let n = 10000;

    let mut data = Vec::<u16>::new();
    let mut ckms = CKMS::<u16>::new(error);

    for _ in 0..n {
        let v = rand::thread_rng().gen_range(u16::MIN, u16::MAX);
        data.push(v);
        ckms.insert(v);
    }

    data.sort();

    for p in 0..1001 {
        let q = p as f64 / 1000.0;
        let idx = cmp::max(1, (q * n as f64).ceil() as usize);
        let expected = data[idx - 1];
        let (_rank, actual) = ckms.query(q).unwrap();
        let diff = expected as f64 - actual as f64;
        if diff.abs() > error {
            println!(
                "q {}: expected {} actual {} diff {} > {}",
                q, expected, actual, diff, error
            );
        }
    }
}
blt commented 4 years ago

Excellent. Thank you. I'm going to convert your code here into a test in the suite. Will report back.

blt commented 4 years ago

@c0deaddict I think the error is in your comparison code. I put together a branch that includes your test code and trims it down some. This is it: https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L293-L366

I first wondered if the reported issue could be triggered without resorting to random data, and it could. I found somewhat accidentally that the issue could be triggered by a drastically smaller number of points, 10 in the above snippet. I went back to the paper and for the data set 0 .. 10 calculated the rank and approximation the paper suggests. That's what these lines are: https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L320-L338

If you run the code this portion of the test will pass. Good sign. At least my understanding of the paper and the implementation correspond. The remaining portion of the test that fails is: https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L340-L365

This fails like so:

---- ckms::test::test_insert stdout ----
thread 'ckms::test::test_insert' panicked at 'assertion failed: `(left != right)`
  left: `Some(Greater)`,
 right: `Some(Greater)`: total_points: 10 | idx: 2 | (rank, actual): (1, 0) | quantile: 0.101 | 1 > 0.001', src/ckms/mod.rs:353:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    ckms::test::test_insert

The important things to note here is that the idx disagrees with the rank. If you adjust from ceil to floor then the test will pass, but that's only incidentally. Quantile calculation like this requires you to take into account whether the dataset has an odd or even number of members, which the calculation in your example doesn't do. If you adjust to 11 total samples -- and comment out the hand-calculation bit -- then you'll find the test will pass. I believe what this means is that the calculation you've supplied is faulty. I feel pretty confident about that after having gone back over the paper, plus remembering that the tests

https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L419-L448

and

https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L457-L499

both assert that the error guarantee holds, for randomly generated datasets and queries. If you buy this analysis I think what today's debugging session suggests to me is:

  1. That the paper that lays out the data structure is behind a paywall is a severe impediment to trust of the structure and
  2. this library would be greatly improved with a reference quantile structure that consumed arbitrarily large space but returned exact results for queries.
c0deaddict commented 4 years ago

@blt First of all thanks for your thorough investigation, and apologies for my late reply.

@c0deaddict I think the error is in your comparison code. I put together a branch that includes your test code and trims it down some. This is it:

https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L293-L366

I first wondered if the reported issue could be triggered without resorting to random data, and it could. I found somewhat accidentally that the issue could be triggered by a drastically smaller number of points, 10 in the above snippet. I went back to the paper and for the data set 0 .. 10 calculated the rank and approximation the paper suggests. That's what these lines are:

https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L320-L338

If you run the code this portion of the test will pass. Good sign. At least my understanding of the paper and the implementation correspond. The remaining portion of the test that fails is:

https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L340-L365

This fails like so:

---- ckms::test::test_insert stdout ----
thread 'ckms::test::test_insert' panicked at 'assertion failed: `(left != right)`
  left: `Some(Greater)`,
 right: `Some(Greater)`: total_points: 10 | idx: 2 | (rank, actual): (1, 0) | quantile: 0.101 | 1 > 0.001', src/ckms/mod.rs:353:13
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

failures:
    ckms::test::test_insert

The important things to note here is that the idx disagrees with the rank. If you adjust from ceil to floor then the test will pass, but that's only incidentally. Quantile calculation like this requires you to take into account whether the dataset has an odd or even number of members, which the calculation in your example doesn't do. If you adjust to 11 total samples -- and comment out the hand-calculation bit -- then you'll find the test will pass. I believe what this means is that the calculation you've supplied is faulty.

How should the odd/even members be taken into account? Using ceil or floor for one or the other doesn't give right results either. I get the best results by using round.

There seem to be different methods of calculating percentiles. Numpy for example can do a linear interpolation if the quantile is not at an exact array index: https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L3865-L3969 I think the main question is: what method for calculating exact quantiles did the authors of the paper use?

I feel pretty confident about that after having gone back over the paper, plus remembering that the tests

https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L419-L448

and

https://github.com/postmates/quantiles/blob/cde2a223afcf9f3f17c6f1e020ac55ff11ebe1dc/src/ckms/mod.rs#L457-L499

both assert that the error guarantee holds, for randomly generated datasets and queries.

These tests look pretty solid. Only thing that seems strange is that the assertion checks if the actual value v is not bigger than by more than err than the expected value percentile(&data, prcnt), but it could happen that v is smaller than percentile(&data, prcnt) by more than err. For example if v = 80 and percentile(&data, prcnt) = 100, this would slip through but is an error, right?

If you buy this analysis I think what today's debugging session suggests to me is:

1. That the paper that lays out the data structure is behind a paywall is a severe impediment to trust of the structure and

Yes indeed. Is this the paper you based it on: https://ieeexplore.ieee.org/abstract/document/4274974/ Maybe it's a good idea to include a link to it in the source code?

Fortunately there are ways to work around the pay wall :-)

2. this library would be greatly improved with a reference quantile structure that consumed arbitrarily large space but returned exact results for queries.

That would be very useful!

blt commented 4 years ago

@blt First of all thanks for your thorough investigation, and apologies for my late reply.

Not a problem! Things take the time they take.

How should the odd/even members be taken into account? Using ceil or floor for one or the other doesn't give right results either. I get the best results by using round.

There seem to be different methods of calculating percentiles. Numpy for example can do a linear interpolation if the quantile is not at an exact array index: https://github.com/numpy/numpy/blob/master/numpy/lib/function_base.py#L3865-L3969 I think the main question is: what method for calculating exact quantiles did the authors of the paper use?

This is a very good question, and not necessarily something I appreciated when I first wrote this library. I won't likely have a chance to get to the paper until at least next weekend, but this does strike me as a very important question to pursue.

These tests look pretty solid. Only thing that seems strange is that the assertion checks if the actual value v is not bigger than by more than err than the expected value percentile(&data, prcnt), but it could happen that v is smaller than percentile(&data, prcnt) by more than err. For example if v = 80 and percentile(&data, prcnt) = 100, this would slip through but is an error, right?

Ah, indeed. Yes that is an oversight. That should be corrected. I'd be curious to see what turns up when it is.

Yes indeed. Is this the paper you based it on: https://ieeexplore.ieee.org/abstract/document/4274974/ Maybe it's a good idea to include a link to it in the source code?

It's linked in the README but that's a little distant. What you get in the source documentation is not the best:

https://github.com/postmates/quantiles/blob/3f7244236bc6ee23ab3aae486b25c2d9d8b121a2/src/ckms/mod.rs#L6-L12

That would be very useful!

Cool. When I have a little time next I'll fix the test bug you pointed out above and see about including this in the project. Sounds fun.