tdunning / t-digest

A new data structure for accurate on-line accumulation of rank-based statistics such as quantiles and trimmed means
Apache License 2.0
1.99k stars 228 forks source link

Add support for double weights #203

Open EnricoMi opened 2 years ago

EnricoMi commented 2 years ago

What are your thoughts on supporting double weights, instead of integer weights only? This would allow to use (0..1] weights, which would be more convenient than mapping those weights to integers in user code.

This would require to distinguish the semantics of count from weight, which could be beneficial in other use cases as well, e.g. #198.

Obviously, this will introduce a breaking change to the API.

tdunning commented 1 year ago

Yeah... there has been a fair bit of discussion on this.

The core question is what does the t-digest invariant actually mean with non-integer weights.

Do you have thoughts on that?

The key problems in the past include:

So, what do you think?

EnricoMi commented 1 year ago

Centroids now have to maintain their cardinality (the number of samples). Then, the exemption can be done based on the cardinality, not the weight (in fact, weight used to be some kind of cardinality, with the assumption of unit weight).

With non-integer weights, the t-digest invariant ${|C|}{k} = k (q_{right}) − k (q_{left}) ≤ 1$ requires ${q}_{left}$ and ${q}\{right}$ to be normalized by the sum of all weights, not the number of all samples (which used to be the same with unit weights):

$$q_{left} = W_{left}(C)/∑w$$

$$q_{right} = q_{left} + |C|/∑w$$

Then, the invariant should behave identical to equivalent integer weights.

For example, non-integer weighted samples (sample value, sample weight)

(1, 0.1), (2, 0.25), (2, 0.2)

with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 0.1)\}, \{2, 0.45)\}$

(0, 0.1/0.55), (0.1/0.55, 1)

are equivalent to these integer-weighted samples:

(1, 2), (2, 9)

with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 2)\}, \{(2, 9)\}$

(0, 2/11), (2/11, 1)

Difference is that a cluster $\{(1, 0.1)\}$ has cardinality 1, which is exempted from the invariant, while cluster $\{(1, 2)\}$ has cardinality 2, which is not exempted from the invariant.