tvondra / tdigest

PostgreSQL extension for estimating percentiles using t-digest
PostgreSQL License
88 stars 13 forks source link

Trimmed means? #14

Open StanleyP opened 3 years ago

StanleyP commented 3 years ago

Hello, in the heading of README you mention:

"This PostgreSQL extension implements t-digest, a data structure for on-line accumulation of rank-based statistics such as quantiles and trimmed means."

But I have the impression that you don't expose API for getting trimmed mean or at least list of centroids and their parameters programatically.

Am I missing something? Is it possible to calculate trimmed mean using your extension?

StanleyP commented 3 years ago

Just FYI as I am experimenting with TDIGEST I quickly hacked function to convert TDIGEST to DOUBLE PRECISION[], where [0,$count,$compression,$centroids(,$centroid,$centroid_count)+]. With this function I will be able to fetch TDIGESTs from PG and e.g. aggregate on client side etc. It would be nice to have this function provided directly by you extension. Cheers!

CREATE OR REPLACE FUNCTION tdigest_to_array(digest TDIGEST) RETURNS DOUBLE PRECISION[] AS $$
  SELECT
    regexp_split_to_array(
      left(regexp_replace(
        digest::TEXT,
        '^flags 0 count ([0-9]*) compression ([0-9]*) centroids ([0-9]*)','0,\1) (\2, \3)'
      ),-1),
      '(\) \()|,')::DOUBLE PRECISION[]
$$ LANGUAGE sql IMMUTABLE;
tvondra commented 3 years ago

Ah, right - the API does not allow calculating of trimmed means. The t-digest allows that in principle, but it's not implemented. Adding it should not be very difficult though, I just didn't need it and no one requested it so far.

As for exposing the t-digest as double precision, I'm not against doing that in principle, but it seems a bit fragile - people need to know which field is what, etc. I wonder if exposing it as a more structured type (e.g. json) would be better. Or maybe something like

CREATE OR REPLACE FUNCTION tdigest_export(digest tdigest, out count int, out compression int, out ncentroids int, out means double precision, out counts int) AS ...

BTW what's your use case for aggregating the data outside the database?

StanleyP commented 3 years ago

Thanks for the reply! The API for trimmed means would be really great and maybe more people would appreciate it:)

As for the serialization of TDIGEST to DOUBLE PRECISION[] you are right of course, it is fragile, thats why I called it a hack, but it got the job done for me. Maybe define a CAST to JSONB or something would be sufficient.

It would be great if there was standard and compact text/string serialization format for TDIGEST so you can generate TDIGESTs on DB side and consume them in any other programming language/framework.

My use case is fetching pre-aggregated multi-dimensional statistics to front-end JS code and do the additional filtering and grouping in crossfilter2 https://github.com/crossfilter/crossfilter on the client-side.

tvondra commented 3 years ago

Interesting, does crossfilter support tdigest out of the box, or does it require some custom code? I'm not very familiar with this JS stuff, so maybe it's naive question.

Anyway, for JS I think JSON format would be a good match - the question is whether to tweak the the text format or simply add a custom tdigest -> json cast.

tvondra commented 3 years ago

I've pushed two WIP changes to this branch: https://github.com/tvondra/tdigest/tree/json-output

The first one adds a cast to JSON, so that it's possible to do something like this:

select tdigest(random(), 15)::json from generate_series(1,10000)`

or more SQL-standard-compliant

select cast(tdigest(random(), 15) as json) from generate_series(1,10000)`

The format is fairly simple, I think. The one detail I decided not to include is the flags field, which is currently unused, and if used it'll be for internal purposes.

The second commit adds tdigest_mean for calculating the trimmed mean, like this:

select tdigest_mean(r, 15, 0.1, 0.9), avg(r) from data

Both patches need more testing, but some initial feedback would be handy.

StanleyP commented 3 years ago

Ad cast and trimmed mean: great news, thank you very much! Will try to test it ASAP and let you know...

Ad crossfilter: no, crossfilter does not support tdigest out of the box, but you can define custom reducers for any data structure with just couple lines of code.

StanleyP commented 3 years ago

I tried the json-output branch against official docker image postgres:12.4 and I can confirm that it builds, installs and works fine.

For my use case it would be very beneficial if you could compact the JSON representation to a minimum: minimal JSON property names and especially trimming the trailing zeros from numbers in sum array. I am transferring (tens or possibly hundreds of) thousands data points from PG backend to client so every byte counts to the required bandwidth and transfer time.

Further it would be also very beneficial if you could extend the trimmed mean API for TDIGEST type itself so I can calculate trimmed mean of precalculated TDIGEST(s).

Everyones use-case is different, so take it as just my 2 cents to the discussion.

Thanks for your work on this excellent PG extension!

tvondra commented 3 years ago

Yeah, I'll definitely add the trimmed mean for pre-calculated tdigests, I've been too lazy to do that in the experimental WIP patch.

I'll see what I can do to minimize the json - I'm not a huge fan of minimizing the property names, because there are not that many and it'd make it harder to work with. So that doesn't seem like a good trade-off. About the trailing zeroes - I suppose you mean trailing zeroes after the decimal point, right? Like 0.5130000 should become 0.513 etc. I'll see what I can do about it - I'm simply relying on API that does the formatting (using sprintf IIRC) but I'll try to tweak it somehow.

tvondra commented 3 years ago

I've pushed (into the branch) a bit more complete version, adding the function to calculate mean from t-digest and tweaking the formatting. Can you check if this works fine for your use case?

StanleyP commented 3 years ago

Yes, new version also builds, installs and works fine on official docker postgres:12.4. I can confirm that trimmed means can now be applied to pre-calculated TDIGESTs and CAST to JSON seems to be compacted enough to be usable in my use-case. Thanks again for such quick response and all the coding. Hope this WIP will eventually make the master branch for others to use this great features. Cheers!

tvondra commented 3 years ago

I have pushed most of the improvements discussed in this thread into the master branch, after some improvements and polishing. That includes:

1) cast to JSON value 2) cast to DOUBLE PRECISION[] value 3) trimmed aggregates (tdigest_sum/tdigest_avg) 4) tdigest_digest_sum/tdigest_digest_avg calculating sum/avg for individual t-digest values (no aggregation needed)

Can you take a look at the improvements, and see if there's something that doesn't work the way you'd like?

All this is in 1.3.0-dev version. It's not the default version, so it needs to be specified in CREATE EXTENSION explicitly.

I've also modified the tests quite a bit, to improve the coverage and reduce duration.