pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
29.49k stars 1.87k forks source link

Add streaming implementation of `cut` and `qcut` #19038

Closed McToel closed 2 hours ago

McToel commented 2 hours ago

Description

For grouping large datasets on continues data types, binning functions like cut and qcut are essential. However, these are not supported in streaming mode, so currently no aggregation of bins is possible for out of memory datasets.

A little toy example like this results in the following

import polars as pl

lf = pl.LazyFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5]})

lf = (
    lf.with_columns(pl.col('a').cut([1, 3, 5]).alias('bins'))
    .group_by('bins')
    .agg(pl.col('b').mean().alias('mean'))
)

print(lf.explain(streaming=True))
AGGREGATE
        [col("b").mean().alias("mean")] BY [col("bins")] FROM
   WITH_COLUMNS:
   [col("a").cut().alias("bins")] 
    STREAMING:
      DF ["a", "b"]; PROJECT 2/2 COLUMNS; SELECTION: None

As far as I can tell, cut should work with streaming data without big changes, but qcut needs to see all data in order to extract correct breaks. If this is welcome I could contribute myself, but I do not understand how the streaming API is implemented, so I would need some help.

ritchie46 commented 2 hours ago

We will eventually try to make all operations streaming, however having an issue for every operation isn't useful and will cluster the issue board. Therefore I will close this one.