pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.97k stars 1.82k forks source link

Support for approximate quantiles #4878

Open OneRaynyDay opened 1 year ago

OneRaynyDay commented 1 year ago

Problem Description

Polars has quantiles for pl.Expr, pl.LazyFrame and more. I think it would benefit polars to have a faster quantile aggregation method that sacrifices (a bit of) accuracy for speed. Common streaming query engines like Trino have approx quantile implemented but not quantile because the datasets would otherwise OOM. There are typically two ways to do it: t-digest or q-digest.

In a nutshell explanation of the two:

q-digest: Implementing the quantile intervals as complete binary tree of depth N. The quantiles are bounded by the discrete number of intervals that the points can fall into. Uniform bounds across all quantile. t-digest: A modern replacement of q-digest which uses a clustering algorithm with a cost function heavily favoring extreme quantiles, hence has better performance than t-digest for very close to 0 or 1 quantiles (which is what people typically want anyways). Typically has compression parameters to tune the clustering algorithm. Unfortunately, is much more nondeterministic than q-digest typically, but if we partition a sorted stream and combine the clusters in the same way this should be deterministic.

chitralverma commented 1 year ago

@OneRaynyDay , I have the code ready for this implementation but polars already uses quick select algorithm which is significantly faster than the t-digest or dd-sketch implementations.

The reason for this is that the insertions in probabilistic data structures take time.

The only gain I see from adding this implementation is that we can have a constant space algorithm.

mjclarke94 commented 3 months ago

Just came across this, and there are a few motivations beyond space efficiency for using t-digests that would be useful in some of our use cases.

T-digests can be combined, which solve the problem of trying to determine the median of a dataset based on the median of subsets of that data, as is common in streaming or distributed compute scenarios.

Additionally, once the t-digest has been computed, subsequent quartile evaluations are extremely quick. In scenarios like optimisation problems where new quartiles may be frequently evaluated, the option to front load the effort of calculating a probabilistic data structure to then have quicker iterations is valuable.