Add support for double weights

EnricoMi commented 2 years ago

What are your thoughts on supporting double weights, instead of integer weights only? This would allow to use (0..1] weights, which would be more convenient than mapping those weights to integers in user code.

This would require to distinguish the semantics of count from weight, which could be beneficial in other use cases as well, e.g. #198.

Obviously, this will introduce a breaking change to the API.

tdunning commented 1 year ago

Yeah... there has been a fair bit of discussion on this.

The core question is what does the t-digest invariant actually mean with non-integer weights.

Do you have thoughts on that?

The key problems in the past include:

violation of invariant has been allowed for centroids with weight = 1. If weights < 1 are allowed, what is the status of this exemption? Is a weight less than one still an indivisible value?
if the exemption is removed so that all centroids must meet the invariant, we will require an infinite number of centroids for K=2 or K=3. How could that be resolved?
if we assume that any centroid that is added represents a single sample with variable weight rather than the number of samples represented by the weight, then we could allow the exemption from the scale invariant for all samples <= 1. What happens if we merge two such centroids and the weight is still < 1. Do we have to remember that this new centroid has more than one sample?

So, what do you think?

EnricoMi commented 1 year ago

Centroids now have to maintain their cardinality (the number of samples). Then, the exemption can be done based on the cardinality, not the weight (in fact, weight used to be some kind of cardinality, with the assumption of unit weight).

With non-integer weights, the t-digest invariant ${|C|}{k} = k (q_{right}) − k (q_{left}) ≤ 1$ requires ${q}_{left}$ and ${q}\{right}$ to be normalized by the sum of all weights, not the number of all samples (which used to be the same with unit weights):

$$q_{left} = W_{left}(C)/∑w$$

$$q_{right} = q_{left} + |C|/∑w$$

Then, the invariant should behave identical to equivalent integer weights.

For example, non-integer weighted samples (sample value, sample weight)

(1, 0.1), (2, 0.25), (2, 0.2)

with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 0.1)\}, \{2, 0.45)\}$

(0, 0.1/0.55), (0.1/0.55, 1)

are equivalent to these integer-weighted samples:

(1, 2), (2, 9)

with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 2)\}, \{(2, 9)\}$

(0, 2/11), (2/11, 1)

Difference is that a cluster $\{(1, 0.1)\}$ has cardinality 1, which is exempted from the invariant, while cluster $\{(1, 2)\}$ has cardinality 2, which is not exempted from the invariant.

tdunning / t-digest

Add support for double weights #203