Closed mjhanninen closed 7 years ago
Thanks!
The gap is a very small centroid (if you zoom the svg, there is a small bump)
The reason is that when the new observation is inserted somewhere, there are two closest centroids, to the left and to the right.
What we should do is to insert as much as possible to the one which is closer, then insert to the second one, and if there's some weight left to insert, create a new centroid.
For some reason we don't insert into nodes close to the top of the tree:
Node (1,-0.24843016487650285,5104.125,5104.125)
Nil
Node (147,-0.17098052353882248,1.052303888419273,100000.0)
Nil
Node (1,-0.15809838942891036,2785.9707730346577,2785.9707730346577)
for ./tdigest-simple -m tdigest -d standard -s 100000 -c 10 -o output.svg -i 3
(comment out "Debug output" in the bench/Simple.hs
).
And it was stupid mistake: https://github.com/futurice/haskell-tdigest/commit/996d790bbb4243522aaf3702ac01bf2be98397d7
With reference to the gap(s) in the histogram plot I wrote a quick check to see if this could be explained by a bug in the histogram data structure. But, alas, the property tests didn't unearth the bug.
Anyway here's the test.