Add streaming implementation of `cut` and `qcut`

Description

For grouping large datasets on continues data types, binning functions like cut and qcut are essential. However, these are not supported in streaming mode, so currently no aggregation of bins is possible for out of memory datasets.

A little toy example like this results in the following

import polars as pl

lf = pl.LazyFrame({'a': [1, 2, 3, 4, 5], 'b': [1, 2, 3, 4, 5]})

lf = (
    lf.with_columns(pl.col('a').cut([1, 3, 5]).alias('bins'))
    .group_by('bins')
    .agg(pl.col('b').mean().alias('mean'))
)

print(lf.explain(streaming=True))

AGGREGATE
        [col("b").mean().alias("mean")] BY [col("bins")] FROM
   WITH_COLUMNS:
   [col("a").cut().alias("bins")] 
    STREAMING:
      DF ["a", "b"]; PROJECT 2/2 COLUMNS; SELECTION: None

As far as I can tell, cut should work with streaming data without big changes, but qcut needs to see all data in order to extract correct breaks. If this is welcome I could contribute myself, but I do not understand how the streaming API is implemented, so I would need some help.

pola-rs / polars

Add streaming implementation of `cut` and `qcut` #19038

Description