Closed dpasqualin closed 3 years ago
Interesting. I wonder if this depends on how exactly the t-digests look like, but hard to say. Can you provide a reproducer (i.e. a data set exhibiting this issue), or at least the t-digest output?
I think this query replicates the problem. There are an equal number of zeroes and values bigger than zero, so I was expecting that tdigest_percentile_of(t, 0)
would be close to 0.5 (as it is if I use percentile_cont(0.50) within group ( order by n )
. I understand that tdigest is more precise close to 0 or 1, but does that apply to the result of tdigest_percentile_of
too?
Another thing that I found strange is that if I don't order by n
in the first query the results are different (still inaccurate though). Should I order the data prior to using tdigest?
with data as (
select generate_series(1, 10, 0.01) as n
union all
select generate_series(1, 10, 0.01) * 0 as n
order by n
), ttable as (
select
tdigest(n, 100) as t
from
data
)
select
tdigest(t) as tdigest,
tdigest_percentile(t, 0.5) as pctl_50,
tdigest_percentile(t, 0.25) as pctl_25,
tdigest_percentile(t, 0.10) as pctl_10,
tdigest_percentile(t, 0.05) as pctl_05,
tdigest_percentile_of(t, 0) as pctl_of_0,
tdigest_percentile_of(t, 1) as pctl_of_1
from
ttable
tdigest | pctl_50 | pctl_25 | pctl_10 | pctl_05 | pctl_of_0 | pctl_of_1 |
---|---|---|---|---|---|---|
below | 0.794 | 0 | 0 | 0 | 0.223 | 0.512 |
The tdigest:
flags 0 count 1802 compression 100 centroids 44 \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 2\) \(0.000000, 3\) \(0.000000, 4\) \(0.000000, 5\) \(0.000000, 8\) \(0.000000, 11\) \(0.000000, 15\) \(0.000000, 21\) \(0.000000, 29\) \(0.000000, 40\) \(0.000000, 53\) \(0.000000, 68\) \(0.000000, 84\) \(0.000000, 98\) \(0.000000, 98\) \(0.000000, 107\) \(0.000000, 152\) \(172.050000, 208\) \(572.460000, 188\) \(748.890000, 157\) \(765.700000, 124\) \(675.180000, 93\) \(548.420000, 68\) \(423.850000, 49\) \(308.210000, 34\) \(224.520000, 24\) \(152.880000, 16\) \(106.590000, 11\) \(78.280000, 8\) \(49.250000, 5\) \(39.580000, 4\) \(19.850000, 2\) \(19.890000, 2\) \(9.960000, 1\) \(9.970000, 1\) \(9.980000, 1\) \(9.990000, 1\) \(10.000000, 1\)
Yes - the accuracy trade off applies to tdigest_percentile too. The way tdigest works is that it groups the values into "centroids" represented by (mean, count), and the centroids in the "middle" are larger / represent larger number of values. And because the values likely arrive in random order, it's likely that the 0 and non-0 values will be combined into a centroid, thus having non-zero mean. Which might explain the issue, I think.
In the centroid you shared, the last "perfectly 0" centroid is (0.000000, 152) - all centroids after that have mean > 0. This covers ~800 values, so only about 44% of the input data, so 0.5 percentile will have to approximate somehow. The next centroid is (172.050000, 208) which has a mean ~0.827, so pretty close to the pctl_50 (the percentile is reached a bit before that).
So yeah, this is likely due to how tdigest works in general, and the accuracy is certainly worse for values close to 0.5.
One way to improve this is probably increasing the number of centroids, e.g. using tdigest(n, 1000) instead of tdigest(n, 100). That'll need more memory, of course, but that's the principal trade off here.
FWIW another option would be to use a different function to determine sizes of centroids. By using "sigmoid" function we're building smaller centroids on the tails, to provide better accuracy there. But we could use "constant" that keeps all centroids roughly the same size - that'd give us better accuracy in the middle, in exchange for worse accuracy on the tails. In fact, this is discussed in the tdigest paper, IIRC. But I'm not planning to do that, the tails are more important for most people.
@tvondra thank you very much for your answer, I'll experiment with more centroids, and as this is not a bug I'm closing the issue. And of course thank you for implementing this extension in the first place, even though it doesn't cover this specific use case it is still very useful :)
Hi, not sure if a bug, expected behavior, or I just should read some books about statistics.
As
tdigest_percentile(tdigest, 0.25) -> 0
, I was expecting thattdigest_percentile_of(tdigest, 0)
would be >= 0.25, but it never is:The example above uses tdigest with 100 buckets and each row is an aggregate over thousands of records.
Simplified query: