phadej / tdigest

On-line accumulation of rank-based statistics such as quantiles and trimmed means
30 stars 7 forks source link

Test the validity of histogram #10

Closed mjhanninen closed 7 years ago

mjhanninen commented 7 years ago

With reference to the gap(s) in the histogram plot I wrote a quick check to see if this could be explained by a bug in the histogram data structure. But, alas, the property tests didn't unearth the bug.

Anyway here's the test.

phadej commented 7 years ago

Thanks!

phadej commented 7 years ago

The gap is a very small centroid (if you zoom the svg, there is a small bump)

The reason is that when the new observation is inserted somewhere, there are two closest centroids, to the left and to the right.

What we should do is to insert as much as possible to the one which is closer, then insert to the second one, and if there's some weight left to insert, create a new centroid.

For some reason we don't insert into nodes close to the top of the tree:

               Node (1,-0.24843016487650285,5104.125,5104.125)
                  Nil
Node (147,-0.17098052353882248,1.052303888419273,100000.0)
                     Nil
                  Node (1,-0.15809838942891036,2785.9707730346577,2785.9707730346577)

for ./tdigest-simple -m tdigest -d standard -s 100000 -c 10 -o output.svg -i 3 (comment out "Debug output" in the bench/Simple.hs).

phadej commented 7 years ago

And it was stupid mistake: https://github.com/futurice/haskell-tdigest/commit/996d790bbb4243522aaf3702ac01bf2be98397d7