risingwavelabs / risingwave

Best-in-class stream processing, analytics, and management. Perform continuous analytics, or build event-driven applications, real-time ETL pipelines, and feature stores in minutes. Unified streaming and batch. PostgreSQL compatible.
https://www.risingwave.com/slack
Apache License 2.0
6.75k stars 557 forks source link

Support `APPROX_PERCENTILE` #11131

Open lmatz opened 1 year ago

lmatz commented 1 year ago

APPROX_PERCENTILE is supported in Presto, SingleStore, Redis, Snowflake, Vertica, Databricks, TiDB ........ It is supposed to run much faster than its exact equivalents, i.e. PERCENTILE_CONT and PERCENTILE_DISC, which RW has supported. So it is suggested to be used when the dataset is large.

It is typically implemented by the t-digest algorithm. There is an existing implementation at: https://github.com/MnO2/t-digest And also a modified version in https://github.com/apache/arrow-datafusion/blob/main/datafusion/physical-expr/src/aggregate/tdigest.rs.

In essence, by quoting the results from its paper :

The new algorithm described here, the t-digest, provides constant memory bounds and constant relative accuracy while operating in a strictly on-line fashion.

This is used in one user's PoC. PERCENTILE_CONT and PERCENTILE_DISC can be the workaround, but the data size is said to be really large.

fuyufjh commented 1 year ago

Is it possible to maintain it incrementally? We have approx_count_distinct() but it seems a little bit heavy to maintain the state

lmatz commented 1 year ago

Is it possible to maintain it incrementally? We have approx_count_distinct() but it seems a little bit heavy to maintain the state

Per my understanding, it is cheap to maintain. The main logic is this part: https://github.com/MnO2/t-digest/blob/master/src/lib.rs#L226-L306

Correct me if I am wrong, try to summarize:

  1. we have a t-digest already (from the last computation). The size of the t-digest can be freely configured, the larger, the more accurate. For example, we set the size to 100. Then a t-digest contains 100 centroids. The major task of this algorithm is to do data clustering to find such 100 centroids.
  2. we have N new values coming from upstream. We need to first sort N values. And then it is a merge-sort-like procedure. aka O(NlogN + N + 100) to find the new 100 centroids. The details are complicated in terms of how centroids are found.

And for the streaming version, I think we can additionally provide a function interface without specifying the percentile number because the maintenance of 100 centroids is irrelevant to the percentile in the query.

e.g. create materialized view mv as select approx_percentile(column1) as p1 from t And only when we use the batch query, we specify it as select approx_percentile(p1, 0.95) from mv to get it. This way, it is more general and I think there is no extra cost.

lmatz commented 1 year ago

e.g. create materialized view mv as select approx_percentile(column1) as p1 from t And only when we use the batch query, we specify it as select approx_percentile(p1, 0.95) from mv to get it. This way, it is more general and I think there is no extra cost.

I think it also applies to the two exact equivalents, isn't it?

PERCENTILE_CONT and PERCENTILE_DISC are going to keep all the data anyway, so which percentile, e.g. 90% or 99% can be determined when we use the batch query.

lmatz commented 1 year ago

There is another way of doing approximated percentile: https://materialize.com/docs/transform-data/patterns/percentiles/ it can be computed without adding specific statistics functions. But it requires cross self-join

kwannoel commented 1 year ago

This is an implementation prerequisite for the streaming version: https://github.com/risingwavelabs/risingwave/issues/11909.

Additionally we have the following constraints:

Alternatively, we can also support it being computed ad-hoc in batch mode.

lmatz commented 1 year ago

Do we consider the HDR version?

kwannoel commented 1 year ago

Do we consider the HDR version?

It has potential to be pretty fast if we sacrifice precision, and I guess it works "as is".

Let me just recap on the HDR version:

  1. HDR basically drops precision of the input.
  2. The lower the precision, the less buckets we have to maintain.
  3. We have to do streaming nested loop join for each record which comes in, but it is bounded by the number of buckets we have. a. Reason for this is that we have to maintain a cumulative frequency count per bucket.
  4. ~Some part of the query needs to be run in batch, but it's cheap to execute.~ I rewrote it to use dynamic filter.

With a precision of 2, I have ingested 1M rows in under 3minutes.

But anymore than that precision (3 onwards) it drops to a crawl. We probably have to manage large states as well.

Perhaps we can let user try it first, see if it satisfies their requirements.

Demo https://github.com/risingwavelabs/risingwave/pull/11853

It works in streaming.


Separate notes on the implementation:

kwannoel commented 1 year ago

Much faster approach is demonstrated here: https://github.com/risingwavelabs/risingwave/pull/11931.

What's left is to wrap it as a series of sql functions / a single sql function for usability.

kwannoel commented 1 year ago

Much faster approach is demonstrated here: #11931.

What's left is to wrap it as a series of sql functions / a single sql function for usability.

3 options in decreasing generality are stated here, each with different trade-offs: https://github.com/risingwavelabs/risingwave/blob/729ed66ff01acf0b93960ad2e5db3b78f3494175/hdr_approx_in_one.sql

kwannoel commented 11 months ago

Not urgently required anymore. But the functionality is described above: https://github.com/risingwavelabs/risingwave/blob/729ed66ff01acf0b93960ad2e5db3b78f3494175/hdr_approx_in_one.sql

liias commented 10 months ago

Sorry if offtopic, just wanted to also reference UDDSketch (Paper) here. Based on DDSketch (Paper)

More on UDDSketch vs t-digest at https://docs.timescale.com/use-timescale/latest/hyperfunctions/percentile-approx/advanced-agg/ (no affiliation with me).

I don't have experience using any of those, but I would be interested consumer.

TennyZhuang commented 9 months ago

I also take a look at timescale. I found that the UDDSketch can support two-phase aggregation. Do our current implementation support that?

lmatz commented 9 months ago

Do our current implementation support that?

I think so, our current hdr_approx (https://github.com/risingwavelabs/risingwave/blob/729ed66ff01acf0b93960ad2e5db3b78f3494175/hdr_approx_in_one.sql) version does not require any new/special functions. Only max, sum, and count are used. Therefore, any aggregation in it can be done by two-phase aggregation, but whether it brings better performance is another story.

kwannoel commented 9 months ago

Something to add is we use HDR (High Dynamic Range) for the approx percentile rather than t-digest currently. Reason being the t-digest algorithm works for append-only streams, whereas HDR supports updatable streams, and we can already use that approach (albeit with some user-facing complexity, since user needs to compose various queries together to achieve it).

t-digest and UDDSketch could indeed be supported as well, as an optimization.

fuyufjh commented 6 months ago

I just learned the HDR Histogram algorithm from https://github.com/risingwavelabs/risingwave/pull/11931/files, and I think this method doesn't seem to be more brilliant than the naive histogram algorithm built with exponential buckets.

The key of making it "HDR" (or high-precision) is using more buckets. In the example of #11931, it uses 6 (exponent) * 100 (mantissa) = 600 buckets in total. With these many buckets, exponential buckets will achieve better results. Why better? Because exponential is a better division of the whole sample space than HDR's approach.

Illustration of HDR Histogram's bucket:
||||||  |  |  |  |         |         |         |         |    
----------------------------------------------------------->
0    10          100                                    1000

Illustration of naive histogram's bucket:
||| | |  |  |   |   |    |    |     |      |       |      |  
----------------------------------------------------------->
0    10          100                                    1000

(To make it more readable, I didn't follow the correct ratio, so the real cases are more severe than this illustration.)

If we decide to support streaming APPROX_PERCENTILE, I would prefer the naive histogram instead of HDR histogram.

Besides, shall we also update our example #11931 to use naive histogram?

lmatz commented 6 months ago

The principal idea is a mix of the exponential and linear histograms to support a dynamic range precision that is appropriate to the time unit scale.

setting the precision to 0 and then the naive histogram becomes a special case of HDR histogram, i.e. getting rid of the linear histograms part

it uses 6 (exponent) * 100 (mantissa) = 600 buckets in total.

if it is referring to the numbers here: https://github.com/risingwavelabs/risingwave/pull/11931/files#diff-518c64863b11adf8976a4d4cd78cc3b4129650a2990530526fb21d9e9e55f98cR66-R75, it seems that 4 (4 ranges inserted)

Both can be offered, e.g. with different names. I saw Datadog created a new proprietary algorithm DDSketch to replace HDR histogram, and compared the new one only with HDR histogram in its blog: https://www.datadoghq.com/blog/engineering/computing-accurate-percentiles-with-ddsketch/. I assume Datadog knows better in this aspect and treated HDR histogram as SOTA before coming up with DDSketch.

fuyufjh commented 5 months ago

@lmatz Today a user asked this in community channel. IIRC, this requirement has already been requested by users for multiple times. Shall we work it out? cc. @kwannoel We can reassign to another one if you don't have bandwidth.

kwannoel commented 5 months ago

@lmatz Today a user asked this in community channel. IIRC, this requirement has already been requested by users for multiple times. Shall we work it out? cc. @kwannoel We can reassign to another one if you don't have bandwidth.

Hi @fuyufjh , could you find someone else to work on it first if it's urgent? I don't have bandwidth for this at the moment.

kwannoel commented 5 months ago

Related: https://github.com/risingwavelabs/risingwave/issues/12857

github-actions[bot] commented 2 months ago

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean. Don't worry if you think the issue is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

kwannoel commented 1 month ago

Tracking: https://github.com/risingwavelabs/risingwave/issues/17531