Open EnricoMi opened 2 years ago
Yeah... there has been a fair bit of discussion on this.
The core question is what does the t-digest invariant actually mean with non-integer weights.
Do you have thoughts on that?
The key problems in the past include:
So, what do you think?
Centroids now have to maintain their cardinality (the number of samples). Then, the exemption can be done based on the cardinality, not the weight (in fact, weight used to be some kind of cardinality, with the assumption of unit weight).
With non-integer weights, the t-digest invariant ${|C|}{k} = k (q_{right}) − k (q_{left}) ≤ 1$ requires ${q}_{left}$ and ${q}\{right}$ to be normalized by the sum of all weights, not the number of all samples (which used to be the same with unit weights):
$$q_{left} = W_{left}(C)/∑w$$
$$q_{right} = q_{left} + |C|/∑w$$
Then, the invariant should behave identical to equivalent integer weights.
For example, non-integer weighted samples (sample value, sample weight)
(1, 0.1), (2, 0.25), (2, 0.2)
with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 0.1)\}, \{2, 0.45)\}$
(0, 0.1/0.55), (0.1/0.55, 1)
are equivalent to these integer-weighted samples:
(1, 2), (2, 9)
with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 2)\}, \{(2, 9)\}$
(0, 2/11), (2/11, 1)
Difference is that a cluster $\{(1, 0.1)\}$ has cardinality 1, which is exempted from the invariant, while cluster $\{(1, 2)\}$ has cardinality 2, which is not exempted from the invariant.
What are your thoughts on supporting double weights, instead of integer weights only? This would allow to use
(0..1]
weights, which would be more convenient than mapping those weights to integers in user code.This would require to distinguish the semantics of
count
fromweight
, which could be beneficial in other use cases as well, e.g. #198.Obviously, this will introduce a breaking change to the API.