Unexpected results from tdigest_percentile_of(tdigest, hypothetical_value)

dpasqualin commented 3 years ago

Hi, not sure if a bug, expected behavior, or I just should read some books about statistics.

As tdigest_percentile(tdigest, 0.25) -> 0, I was expecting that tdigest_percentile_of(tdigest, 0) would be >= 0.25, but it never is:

time	pctl_50	pctl_of_0	pctl_of_1
2021-01-12 04:00:00.000000	0.074	0.189	0.678
2021-01-12 05:00:00.000000	0.07	0.141	0.675
2021-01-11 17:00:00.000000	0.025	0.152	0.68

The example above uses tdigest with 100 buckets and each row is an aggregate over thousands of records.

Simplified query:

WITH a AS (
    SELECT
        time,
        TDIGEST(data, 100)  AS t_100,
        ...
   GROUP BY
      time, b, c, d
) SELECT 
    time,
    tdigest_percentile(t_100, 0.75) as pctl_75,
    ...
    tdigest_percentile_of(t_100, 0) as pctl_of_0,
    tdigest_percentile_of(t_100, 1) as pctl_of_1
FROM
    a
GROUP BY
    time

tvondra commented 3 years ago

Interesting. I wonder if this depends on how exactly the t-digests look like, but hard to say. Can you provide a reproducer (i.e. a data set exhibiting this issue), or at least the t-digest output?

dpasqualin commented 3 years ago

I think this query replicates the problem. There are an equal number of zeroes and values bigger than zero, so I was expecting that tdigest_percentile_of(t, 0) would be close to 0.5 (as it is if I use percentile_cont(0.50) within group ( order by n ). I understand that tdigest is more precise close to 0 or 1, but does that apply to the result of tdigest_percentile_of too?

Another thing that I found strange is that if I don't order by n in the first query the results are different (still inaccurate though). Should I order the data prior to using tdigest?

with data as (
    select generate_series(1, 10, 0.01) as n
    union all
    select generate_series(1, 10, 0.01) * 0 as n
    order by n
), ttable as (
    select
        tdigest(n, 100) as t
    from
        data
)
select
    tdigest(t) as tdigest,
    tdigest_percentile(t, 0.5) as pctl_50,
    tdigest_percentile(t, 0.25) as pctl_25,
    tdigest_percentile(t, 0.10) as pctl_10,
    tdigest_percentile(t, 0.05) as pctl_05,
    tdigest_percentile_of(t, 0) as pctl_of_0,
    tdigest_percentile_of(t, 1) as pctl_of_1
from
    ttable

tdigest	pctl_50	pctl_25	pctl_10	pctl_05	pctl_of_0	pctl_of_1
below	0.794	0	0	0	0.223	0.512

The tdigest:

flags 0 count 1802 compression 100 centroids 44 \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 1\) \(0.000000, 2\) \(0.000000, 3\) \(0.000000, 4\) \(0.000000, 5\) \(0.000000, 8\) \(0.000000, 11\) \(0.000000, 15\) \(0.000000, 21\) \(0.000000, 29\) \(0.000000, 40\) \(0.000000, 53\) \(0.000000, 68\) \(0.000000, 84\) \(0.000000, 98\) \(0.000000, 98\) \(0.000000, 107\) \(0.000000, 152\) \(172.050000, 208\) \(572.460000, 188\) \(748.890000, 157\) \(765.700000, 124\) \(675.180000, 93\) \(548.420000, 68\) \(423.850000, 49\) \(308.210000, 34\) \(224.520000, 24\) \(152.880000, 16\) \(106.590000, 11\) \(78.280000, 8\) \(49.250000, 5\) \(39.580000, 4\) \(19.850000, 2\) \(19.890000, 2\) \(9.960000, 1\) \(9.970000, 1\) \(9.980000, 1\) \(9.990000, 1\) \(10.000000, 1\)

tvondra commented 3 years ago

Yes - the accuracy trade off applies to tdigest_percentile too. The way tdigest works is that it groups the values into "centroids" represented by (mean, count), and the centroids in the "middle" are larger / represent larger number of values. And because the values likely arrive in random order, it's likely that the 0 and non-0 values will be combined into a centroid, thus having non-zero mean. Which might explain the issue, I think.

In the centroid you shared, the last "perfectly 0" centroid is (0.000000, 152) - all centroids after that have mean > 0. This covers ~800 values, so only about 44% of the input data, so 0.5 percentile will have to approximate somehow. The next centroid is (172.050000, 208) which has a mean ~0.827, so pretty close to the pctl_50 (the percentile is reached a bit before that).

So yeah, this is likely due to how tdigest works in general, and the accuracy is certainly worse for values close to 0.5.

One way to improve this is probably increasing the number of centroids, e.g. using tdigest(n, 1000) instead of tdigest(n, 100). That'll need more memory, of course, but that's the principal trade off here.

tvondra commented 3 years ago

FWIW another option would be to use a different function to determine sizes of centroids. By using "sigmoid" function we're building smaller centroids on the tails, to provide better accuracy there. But we could use "constant" that keeps all centroids roughly the same size - that'd give us better accuracy in the middle, in exchange for worse accuracy on the tails. In fact, this is discussed in the tdigest paper, IIRC. But I'm not planning to do that, the tails are more important for most people.

dpasqualin commented 3 years ago

@tvondra thank you very much for your answer, I'll experiment with more centroids, and as this is not a bug I'm closing the issue. And of course thank you for implementing this extension in the first place, even though it doesn't cover this specific use case it is still very useful :)

tvondra / tdigest

Unexpected results from tdigest_percentile_of(tdigest, hypothetical_value) #15