tvondra / tdigest

PostgreSQL extension for estimating percentiles using t-digest
PostgreSQL License
87 stars 13 forks source link

API to allow incremental updates #7

Open tvondra opened 4 years ago

tvondra commented 4 years ago

Started as an e-mail discussion with @sporty81, who asked how to do incremental updates, i.e. being able to do something like this:

UPDATE t SET digest = tdigest_add(digest, new_value);

or something like that. A very early WIP version of this is implemented in #6 which essentially adds tdigest_add() and tdigest_union() that are conceptually equivalent to hll_add and hll_union. I'm sure the current code is rather inefficient but hopefully it's good enough for some basic feedback.

I'm more worried about accuracy, though. Essentially, every tdigest_add has to deserialize the value, adds the new value and then compacts the t-digest. The (de)serialization might be fairly expensive, but I'm sure there are ways to improve that, to some extent (one of the limiting factors will be that Postgres does not allow partial / in-place updates, so any change to the tdigest has to store a new copy of the whole value).

But I suspect compacting the digests after adding every single value may have negative impact on accuracy of the digest, so maybe we'll need to reconsider this strategy and invoke compaction less frequently.

sporty81 commented 4 years ago

What can I do to help you finish this up and release this update?

Thanks!

tvondra commented 3 years ago

Sorry, I was off for a couple days, hence the delay in responding.

I think this needs a bit of a research to

(a) quantify how expensive the de-serialization and serialization for every increment is an issue in practice. OTOH we can't do much about this anyway, because the digest is already a plain varlena buffer anyway, so most of the overhead will come from pglz and toasting.

(b) see if compacting after every value has negative impact on accuracy of the t-digest, or whether we need to batch it somehow (which I think we could without breaking the on-disk format by using a flag). I think the easiest way to evaluate this is to compare tdigests built incrementally (by using the WIP patch) and using the regular aggregates.

tvondra commented 3 years ago

I've significantly reworked/improved this feature, cleaned up the code and added docs and tests.

The API that I propose has three new functions:

I'm not sure whether to rename the third function to tdigest_add too - the naming difference seems somewhat arbitrary.

There's one more thing - by default, serializing the in-memory representation into t-digest forces compaction. But forcing it after adding each individual value does not seem like a great idea, so I've added a parameter that allows disabling it. Not sure if that's a good idea, though.

tvondra commented 3 years ago

I've polished the dev branch a bit and pushed it as 9d34031f01bd9f01d3fdff1a7b3567c94d1fb4d6.