Open mhconradt opened 2 years ago
I will take a look at this one a bit later.
After 1.0 release, this would be something nice to have. It's especially useful if you have a large object in one of the dataframes you are co-grouping, such as parameters of a model you want to apply on a sub-group of your data. Currently, to use it in a vectorized fashion, you would need to broadcast the model parameter values for every subset, or wrap it in partial functions.
Describe your feature request
Co-grouped apply is a hybrid of join, groupby, and apply. It can be conceptualized as a full outer join on the group key between lists of rows, plus applying a UDF to the two lists of rows. There are some applications such as labelling data where it's useful to operate on all of the data that shares some key from two DataFrames. This is currently supported in PySpark.
Co-grouped apply would support only regular
groupby
, notgroupby_dynamic
.cogroup
would be a method onGroupBy
that accepts anotherGroupBy
as a parameter.cogroup
would return aCoGroupBy
, which would have only one method in its public API:apply
.CoGroupBy.apply
would accept one parameter: a UDF with twoDataFrame
objects as its arguments. This UDF should be called with twoDataFrames
for each key in the union of group keys from the twoGroupBy
objects. If a key exists in only the "left"DataFrame
, the UDF should be called with the data from the left and an emptyDataFrame
with the same schema as the right.Full example (my actual use case):