tvondra / tdigest

PostgreSQL extension for estimating percentiles using t-digest
PostgreSQL License
88 stars 13 forks source link

Add <value, count> API #3

Closed min-mwei closed 3 years ago

min-mwei commented 4 years ago

The use case is that, the data input, instead of an array of doubles, could be <value, count> pairs, say, <0.1, 100>, <0.2, 50> that there are 100 occurrences of value 0.1, 50 occurrences of 0.2. Effectively these are centroids from the calculation's perspective.

In conversation with @thanodnl who is adding support for tdigest in Citus, I created this separate pull request to create the API in the tdigest core first.

I am trying to follow the spirit of the current API design, please feel free to make alternative suggestions. Also implementation wise I think it's better to duplicate the C code a bit to make the intent clearer.

Please let me know what you think. Support this API is essential for my use case (as the count could be very big and make it impossible to use the existing SQL API). Thanks, --min

tvondra commented 3 years ago

I started looking at this again - aside from the refactoring / code reuse discussed earlier (wihch I can take care of), I have one concern regarding how this is implemented. Currently, it simply calls tdigest_add_centroid adding one centroid with the whole count. But that means we may end up with one "huge" centroid in the tail part, where we should have much smaller centroids.

In other words, I suppose this

SELECT tdigest(1.0, 100, 10)

should produce the same t-digest as

SELECT tdigest(1.0, 10) FROM generate_series(1,100)

So I wonder if we should "explode" the count into individual values, doing something like this

    for (i = 0; i < count; i++)
        tdigest_add_centroid(state, PG_GETARG_FLOAT8(1), 1);

instead of just adding one huge centroid. This is doing to be more expensive as it may require compacting the t-digest (possibly repeatedly) but it's still way cheaper than adding the values one by one. Particularly when working with pre-calculated t-digests, requiring (de)serialization for each SQL-level call.

tvondra commented 3 years ago

I've spent a bit of time working on this - adding functions with count for other parts of the API, improving tests, etc. I can't push to the branch, so I've pushed the reworked verson into https://github.com/tvondra/tdigest/tree/value-count-api. Ultimately, this adds 4 aggregates, listed in the "Pre-aggregated data" section of the README.

So far I haven't managed to refactor the functions to improve the code reuse - I'll look at that next, but TBH I'm not too worried about code duplication.

I'm a big concerned about one thing, though - I wonder if pre-aggregating the data like this may have impact on accuracy of the resulting t-digest. Aside from the "large centroid" issue mentioned earlier (and solved by adding the values one by one), this may mean the pre-aggregated data may end up being sorted. I suspect that may negatively affect the t-digest by skewing it in some way, but I'm not sure. It needs some experiments / testing.

tvondra commented 3 years ago

BTW I've reverted the Makefile changes - I'm not against improving that part, but let's do that separately, not as part of this. In do however plan to stick to only having upgrade SQL scripts, i.e. no standalone SQL scripts for intermediate versions. That's what we do for extensions in contrib, so let's stick to that.

tvondra commented 3 years ago

So, I did a bunch of tests to see if pre-aggregating the data and using the value+count API has negative effect on accuracy of the results. Attached is a simple SQL script I used for that - it creates a table, populates it with a data and then there's a couple views that measure the accuracy as a distance of percentiles from percentile_cont results. All the values are from [0.0, 1.0] interval.

We know the t-digest is more accurate on the tails by design, and it's probably used for looking at such percentiles. So I've calculated "distance" for tails of varying length. 0.1 means the result only considered percentiles [0.0, 0.1] and [0.9, 1.0].

I've been wondering what's the impact of ordering of the input data, so the results consider cases with input data randomized (ORDER BY random()), asc and desc (ORDER BY v ASC/DESC). The other thing is influence of the data distribution - it may be either uniform, or skewed in some way. The SQL script allows setting the pow() parameter - low values skew towards 1.0, high values skew towards 0.0.

Typical results for uniform distribution look like this:

    p  | simple_random | simple_asc | simple_desc | preagg_random | preagg_asc | preagg_desc 
  -----+---------------+------------+-------------+---------------+------------+-------------
  0.01 |        0.0012 |    0.00062 |     0.00064 |       0.00153 |    0.00062 |     0.00064
  0.05 |        0.0102 |    0.00461 |     0.00419 |       0.01332 |    0.00461 |     0.00419
  0.10 |       0.02292 |    0.00904 |     0.00865 |       0.03104 |    0.00904 |     0.00865
  0.20 |       0.04189 |    0.02154 |     0.02061 |       0.05488 |    0.02154 |     0.02061
  0.30 |        0.0692 |    0.02956 |     0.02993 |        0.0756 |    0.02956 |     0.02993
  0.40 |       0.10915 |    0.03291 |     0.03804 |       0.09426 |    0.03291 |     0.03804
  0.50 |       0.12462 |    0.04414 |     0.04354 |       0.12213 |    0.04414 |     0.04354

For pow(x, 0.5) it looks like this:

  p   | simple_random | simple_asc | simple_desc | preagg_random | preagg_asc | preagg_desc 
------+---------------+------------+-------------+---------------+------------+-------------
 0.01 |       0.00396 |    0.00447 |     0.00438 |       0.00554 |    0.00447 |     0.00438
 0.05 |       0.01348 |     0.0082 |     0.00999 |       0.02497 |     0.0082 |     0.00999
 0.10 |       0.01772 |    0.01702 |     0.02348 |       0.02101 |    0.01702 |     0.02348
 0.20 |       0.03138 |    0.03486 |     0.03581 |       0.04954 |    0.03486 |     0.03581
 0.30 |       0.04346 |    0.04498 |     0.04692 |       0.04808 |    0.04498 |     0.04692
 0.40 |       0.05893 |    0.04979 |     0.05195 |       0.06639 |    0.04979 |     0.05195
 0.50 |       0.08098 |    0.06522 |     0.06013 |       0.06685 |    0.06522 |     0.06013

and for pow(x, 2.0) it looks like this:

  p   | simple_random | simple_asc | simple_desc | preagg_random | preagg_asc | preagg_desc 
------+---------------+------------+-------------+---------------+------------+-------------
 0.01 |       0.00175 |    0.00116 |     0.00096 |       0.00296 |    0.00116 |     0.00096
 0.05 |         0.016 |    0.00629 |     0.00641 |       0.01625 |    0.00629 |     0.00641
 0.10 |        0.0284 |    0.01245 |     0.01241 |       0.03844 |    0.01245 |     0.01241
 0.20 |       0.04509 |    0.02346 |     0.02251 |       0.06301 |    0.02346 |     0.02251
 0.30 |        0.1042 |    0.03899 |      0.0414 |        0.0975 |    0.03899 |      0.0414
 0.40 |       0.11788 |     0.0898 |     0.07147 |       0.14646 |     0.0898 |     0.07147
 0.50 |       0.14631 |    0.14352 |     0.08708 |       0.15041 |    0.14352 |     0.08708

The main observation is that there's no difference between the simple and pre-aggregated results. If you compare e.g. simple_random and preagg_random, the overall behavior is almost exactly the same. Same for the other columns. This suggests the value+count API does not make the t-digests less accurate, which was my concern.

A secondary observation is that randomized ordering is generally outperformed by both ASC or DESC. I find that rather surprising, I'd have expected the exact opposite to be true. I wonder why is that, but it applies to the simple API too so it's not really related to this pull request.

I'll do a bunch more tests with different parameters (number of distinct values, number of occcurrences), but it looks fine.

tdigest-test.txt

min-mwei commented 3 years ago

Hi. Tomas. I just "saw" the work you have done on this, for some reason, the email notification did not fire and I have been busy with other things. I will read your tests and changes. And yes, it makes sense as what you are doing with a separate branch.

min-mwei commented 3 years ago

I just created a build based on Tomas's branch and its API, and see how it works. The environment is PG13.1 with Citus9.5. I will share some results here if possible, otherwise will go over emails. Thanks.

min-mwei commented 3 years ago

I did some initial test. I just realized your new version is doing the "add one number" at a time in the C code. That won't work for my use case. I enclosed a real world example below, as you can see the count could go to billions. Also there are numerical changes in the 1.0.x, the metric result for the enclosed data, is incorrect (e.g. the P95 should be 26, my old version returns 25.2, now it's ~0).

 count    |     value    
-------------+------------
 47325940488 |          1
 15457695432 |          2
  6889790700 |          3
  4188763788 |          4
  2882932224 |          5
  2114815860 |          6
  1615194324 |          7
  2342114568 |          9
  1626471924 |         11
  1660755408 |         14
  1143728292 |         17
  1082582424 |         21
   911488284 |         26
   728863908 |         32
   654898692 |         40
   530198076 |         50
   417883440 |         62
   341452344 |         77
   274579584 |         95
   231921120 |        118
   184091820 |        146
   152469828 |        181
   125634972 |        224
   107059704 |        278
    88746120 |        345
    73135668 |        428
    61035756 |        531
    50683320 |        658
    42331824 |        816
    35234400 |       1012
    29341356 |       1255
    24290928 |       1556
    20284668 |       1929
    17215908 |       2391
    14737488 |       2964
    12692772 |       3674
    11220732 |       4555
     9787584 |       5647
     8148420 |       7000
     6918612 |       8678
     6015000 |      10758
     5480316 |      13336
     5443356 |      16532
     4535616 |      20494
     3962316 |      25406
     3914484 |      31495
     3828108 |      39043
     3583536 |      48400
     4104120 |      60000
   166024740 | 2147483647
thanodnl commented 3 years ago

@tvondra Thanks for running these comprehensive tests. Impressive work!

The main observation is that there's no difference between the simple and pre-aggregated results. If you compare e.g. simple_random and preagg_random, the overall behavior is almost exactly the same. Same for the other columns. This suggests the value+count API does not make the t-digests less accurate, which was my concern.

@tvondra Is it correct for me to assume that your conclusion here is that adding a centroid at a time does not compromise the accuracy? If that is the case, what is holding us back from going that route and implement this throughout the API (if not already on your branch)? If it is time I might be able to help go through all API endpoints and cramp it out. However, if we can't add centroids at a time due to accuracy and we can't add value at a time due to the distribution shared above, I might have some ideas where we can meet in the middle.

From a Citus perspective we will need to add a bit of code to our optimiser to make sure the new API can be pushed down to the workers in distributed queries, however, that is trivial after the API signature is finalised. I am in contact with Min to understand if he is running into limitations of Citus and if a development version can help him.

tvondra commented 3 years ago

@min-mwei Thanks for the feedback. Yeah, if the counts are in the billions, then adding the values one by one is not going to fly. Doing that loop inside the aggregate is likely faster than actually passing the values one by one, but with billions that's still a lot of time.

My concern is that if we simply add one huge centroid, it'll be fast but it will mess up the t-digest in some way, making it less accurate or something. In principle, the size of the centroid depends on where in the distribution it is - on the tails, we keep very tiny centroids, while close to the median the centroids are much larger. This gives the t-digests good accuracy for percentiles close to 1.0, etc. Imagine you have a t=digest representing 10000 values, and now you inject a single centroid (value, 10000) where the value is larger than any value in the original t-digest. How will that affect the accuracy of estimates close to 1.0? I haven't tried that, but I'd bet it'll degrade.

I understand handling this as adding the values one by one is not suitable for such high counts, but OTOH adding a single huge centroid does not seem like a good solution. I may be wrong, though. I guess one way to move this forward is doing some testing, showing the performance and accuracy impact.

tvondra commented 3 years ago

@tvondra Is it correct for me to assume that your conclusion here is that adding a centroid at a time does not compromise the accuracy? If that is the case, what is holding us back from going that route and implement this throughout the API (if not already on your branch)? If it is time I might be able to help go through all API endpoints and cramp it out. However, if we can't add centroids at a time due to accuracy and we can't add value at a time due to the distribution shared above, I might have some ideas where we can meet in the middle.

@thanodnl Well, not quite. As @min-mwei pointed out, this implementation is internally adding the values one by one, not adding a single centroid. So the conclusions are more about "grouping" the values and order in which we add them, not about adding them as a single centroid. Plus the counts were rather low (10-60), centrainly not in billions as @min-mwei mentions.

My guess is the right approach will be something in between adding the values one by one and adding one huge centroid. As I mentioned in a comment, I think we can look at the existing centroids are around the value, and add centroids of this size. That should be better than the current (naive) approach, without breaking the accuracy.

IMO the thing holding this back is that I'm not sure how much this affects the "quality" of the t-digest - it seems pointless to speed something 100x if the results gets too inaccurate to be useful. I'm worried we'll add this API, people will start using it and then we'll discover the the accuracy is terrible, but we'll be stuck with it. AFAIK the t-digests don't have analytical accuracy guarantees, so the only thing we can do is a bunch of tests with such data sets.

min-mwei commented 3 years ago

Thanks for the discussions. I think there are two issues: 1) adding a proper API for centrioids 2) do algo work to make the results more accurate.

For 1), I think it needs to happen at least in my use case. Besides the "raw data" could come in as histograms, we also need to interop with other systems which might use tdigest as well! The <value, count> is a more "portable" data interchange format.

For 2), this is something we could improve later on. Currently I hacked Tomas's API change to "force centroids" instead of looping through by 1. For "my real world", the results I have seen are usable.

so the only thing we can do is a bunch of tests with such data sets.

Yes, So I think the API should be introduced, and add warnings in "doc". I could share some data <value, count> pairs in the near future for the algo to get improved.

min-mwei commented 3 years ago

Just reforked/synced, to have a cleaner change which I am using it for production. https://github.com/tvondra/tdigest/pull/16

tvondra commented 3 years ago

I finally had some time to look at this again over the weekend - I've been skeptical about adding a single large centroid, and the more I think about it the more I think it'll have negative effects on accuracy of the estimates. Maybe not always, but certainly if the large centroid ends up on a tail of the t-digest (close to 0.0 or 1.0).

I mentioned that maybe we should look at the existing t-digest, see what's the centroid size close to where the new value belongs (essentially, lookup the closest centroids before/after the new value), and add centroids of that size. But after thinking about that, that's fairly unpredictable. The value may actually be on the tail, so the centroids would be tiny, and we wouldn't optimize anything.

But I had a different idea - we can generate artificial t-digest representing the (value, count). It's trivial to build such t-digest in one pass, by solving the q0/q2 inequalities and adding centroids. And then we can "merge" this t-digest into the existing one. Which is a bit slower than adding the one large centroid, but much much faster than adding centroids one by one. And the resulting t-digest is actually correct (well, unless there are some bugs).

I've pushed this into the https://github.com/tvondra/tdigest/commits/value-count-api branch - it seems to be working fine, so maybe take a look.

While working on this, I ran into a couple pre-existing issues, though.

Firstly, tdigest_deserialize was not marked as STRICT, which caused crashes in some rare cases. I fixed this, but this will require a bit more work (new minor version, upgrade script).

Secondly, I realized sorting by (mean, count) does not really work well for the artificial t-digest, because all the centroids have the same mean, so we end up sorting by count only. The consequence however is that the centroids at the end have the highest counts, i.e. it breaks the idea that centroids on tails are small. Which may mean lower accuracy for percentiles close to 1.0.

Interestingly enough, Ted Dunning noticed this issue some time ago, and solved by rebalancing the centroids after the sort:

https://github.com/tdunning/t-digest/commit/eca1125a39c13e918d7054105dc89bcda11cfa44

The branch now does something similar, and it seems to be working fine.

min-mwei commented 3 years ago

Great, I just tried out the test cased I enclosed above. It does have a better bound than a simple single centroid, and the perf is comparable.

with data(c, v) as (values
(47325940488 ,          1),
(15457695432 ,          2),
(6889790700 ,          3),
(4188763788 ,          4),
(2882932224 ,          5),
(2114815860 ,          6),
(1615194324 ,          7),
(2342114568 ,          9),
(1626471924 ,         11),
(1660755408 ,         14),
(1143728292 ,         17),
(1082582424 ,         21),
(911488284 ,         26),
(728863908 ,         32),
(654898692 ,         40),
(530198076 ,         50),
(417883440 ,         62),
(341452344 ,         77),
(274579584 ,         95),
(231921120 ,        118),
(184091820 ,        146),
(152469828 ,        181),
(125634972 ,        224),
(107059704 ,        278),
( 88746120 ,        345),
( 73135668 ,        428),
( 61035756 ,        531),
( 50683320 ,        658),
( 42331824 ,        816),
( 35234400 ,       1012),
( 29341356 ,       1255),
( 24290928 ,       1556),
( 20284668 ,       1929),
( 17215908 ,       2391),
( 14737488 ,       2964),
( 12692772 ,       3674),
( 11220732 ,       4555),
(  9787584 ,       5647),
(  8148420 ,       7000),
(  6918612 ,       8678),
(  6015000 ,      10758),
(  5480316 ,      13336),
(  5443356 ,      16532),
(  4535616 ,      20494),
(  3962316 ,      25406),
(  3914484 ,      31495),
(  3828108 ,      39043),
(  3583536 ,      48400),
(  4104120 ,      60000),
(166024740 , 2147483647))
select tdigest_percentile(t, 0.95) as P95
from (select tdigest(v, c, 100) as t from data) sub;
      p95        
------------------
 29.5167629167667

It's quite close to 26. I would be happy to see this make it into the master branch!

tvondra commented 3 years ago

Yeah, I don't think we can do much better than that - with this type of data (very few distinct values) the accuracy depends a lot on how the centroids get merged, and so on. In fact, it depends a lot on ordering of the input data - try this:

select tdigest_percentile(t, 0.95) from (Select tdigest(value, count, 100) as t from (select * from t order by random()) bar) foo;

I regularly see values between 30-40, which is not that great. But the solution is using larger t-digests - using 1000 instead of 100 makes is way more accurate.

Obviously, there might still be bugs affecting the accuracy, but even in that case I think this type of data is problematic.

tvondra commented 3 years ago

BTW the last commit in the branch changes the on-disk format a bit - we need to store mean instead of sum for each centroid, to prevent unnecessary rounding errors. Which causes issues because merging two centroids with the same mean may produce a centroid with a slightly different mean, which may negatively affect the t-digest - e.g. the new mean may be sorted somewhere else, making the t-digest not well-formed.

The change is made in backwards-compatible way, i.e. it should be possible to read current t-digests using the new code and get the right results. But maybe that's worth testing.

It also affects the text output, so if some client code works with the t-digest directly (parses the string), that will require changes (using the "flags" value).

It'd be good to test if this actually works correctly. I'll do that as part of the development, but if you could give it a try on your exiting data, that'd be nice.

min-mwei commented 3 years ago

Sounds great Tomas. Yeah, the metric data range is all over the place, which we get when measuring 10s millions of devices in the wild. The approx percentile via tdigest gives us a sense of direction when the underlying code is changed.

Currently in my use case, I don't store tdigest on disk just yet, as I am using a specialized hstore to accumulate and merge <value, count> pairs received an external system. So I don't have a "real world system" to test against, though this API change would let me use the tdigest type in tables now.

tvondra commented 3 years ago

I've pushed this, after a bit of additional polishing.