pola-rs / polars

Dataframes powered by a multithreaded, vectorized query engine, written in Rust
https://docs.pola.rs
Other
27.89k stars 1.71k forks source link

CoGrouped Apply #2235

Open mhconradt opened 2 years ago

mhconradt commented 2 years ago

Describe your feature request

Co-grouped apply is a hybrid of join, groupby, and apply. It can be conceptualized as a full outer join on the group key between lists of rows, plus applying a UDF to the two lists of rows. There are some applications such as labelling data where it's useful to operate on all of the data that shares some key from two DataFrames. This is currently supported in PySpark.

Co-grouped apply would support only regular groupby, not groupby_dynamic. cogroup would be a method on GroupBy that accepts another GroupBy as a parameter.

cars.groupby("manufacturer").cogroup(boats.groupby("manufacturer"))

cogroup would return a CoGroupBy, which would have only one method in its public API: apply. CoGroupBy.apply would accept one parameter: a UDF with two DataFrame objects as its arguments. This UDF should be called with two DataFrames for each key in the union of group keys from the two GroupBy objects. If a key exists in only the "left" DataFrame, the UDF should be called with the data from the left and an empty DataFrame with the same schema as the right.

Full example (my actual use case):

import polars as pl

market_data = get_market_data()
features = get_features(market_data)

def label(features: pl.DataFrame, bars: pl.DataFrame) -> pl.DataFrame:
    # Label each example as a buy, sell or no-op based on subsequent price movements.
    return features.with_column(compute_label(features, bars).alias('label'))

examples = features.groupby('market').cogroup(market_data.groupby('market')).apply(label)
ritchie46 commented 2 years ago

I will take a look at this one a bit later.

NowanIlfideme commented 3 weeks ago

After 1.0 release, this would be something nice to have. It's especially useful if you have a large object in one of the dataframes you are co-grouping, such as parameters of a model you want to apply on a sub-group of your data. Currently, to use it in a vectorized fashion, you would need to broadcast the model parameter values for every subset, or wrap it in partial functions.