Open tvondra opened 4 years ago
What can I do to help you finish this up and release this update?
Thanks!
Sorry, I was off for a couple days, hence the delay in responding.
I think this needs a bit of a research to
(a) quantify how expensive the de-serialization and serialization for every increment is an issue in practice. OTOH we can't do much about this anyway, because the digest is already a plain varlena buffer anyway, so most of the overhead will come from pglz and toasting.
(b) see if compacting after every value has negative impact on accuracy of the t-digest, or whether we need to batch it somehow (which I think we could without breaking the on-disk format by using a flag). I think the easiest way to evaluate this is to compare tdigests built incrementally (by using the WIP patch) and using the regular aggregates.
I've significantly reworked/improved this feature, cleaned up the code and added docs and tests.
The API that I propose has three new functions:
tdigest_add(tdigest, double precision)
- adds a single value to the t-digesttdigest_add(tdigest, double precision[])
- adds an array of values to the t-digesttdigest_union(tdigest, tdigest)
- merges-in a t-digestI'm not sure whether to rename the third function to tdigest_add
too - the naming difference seems somewhat arbitrary.
There's one more thing - by default, serializing the in-memory representation into t-digest forces compaction. But forcing it after adding each individual value does not seem like a great idea, so I've added a parameter that allows disabling it. Not sure if that's a good idea, though.
I've polished the dev branch a bit and pushed it as 9d34031f01bd9f01d3fdff1a7b3567c94d1fb4d6.
Started as an e-mail discussion with @sporty81, who asked how to do incremental updates, i.e. being able to do something like this:
or something like that. A very early WIP version of this is implemented in #6 which essentially adds
tdigest_add()
andtdigest_union()
that are conceptually equivalent tohll_add
andhll_union
. I'm sure the current code is rather inefficient but hopefully it's good enough for some basic feedback.I'm more worried about accuracy, though. Essentially, every
tdigest_add
has to deserialize the value, adds the new value and then compacts the t-digest. The (de)serialization might be fairly expensive, but I'm sure there are ways to improve that, to some extent (one of the limiting factors will be that Postgres does not allow partial / in-place updates, so any change to the tdigest has to store a new copy of the whole value).But I suspect compacting the digests after adding every single value may have negative impact on accuracy of the digest, so maybe we'll need to reconsider this strategy and invoke compaction less frequently.