pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
28.07k stars 1.73k forks source link

Iterator over lazy groupby #8966

Open erinov1 opened 1 year ago

erinov1 commented 1 year ago

Problem description

Is there a fundamental obstruction to being able to consume a lazy groupby as an iterator? I often wish I could do something like

# Each `gdf` is a LazyFrame.
pl.concat(
    complicated_function(gdf, keys, other_lazyframe) for keys, gdf in ldf.groupby(by)
)

where complicated_function depends on the specific keys and some other lazyframes, but cannot be expressed in an agg context. (This works fine for eager DataFrames.)

This is of course similar to a groupby-apply, but my understanding is that groupby-apply kills parallelism even when complicated_function involves only native polars functionality.

warpedgeoid commented 9 months ago

I often miss the way that Pandas allows iteration over groups. This request would seem to be a step in the right direction.

leunga1000 commented 1 month ago

A note for people running into the same issue, other than calling the .collect() function on the lazy frame to make it a frame and then grouping by that (it is not quite as speedy), you might be able to use partition_by() with the over() window function.