tidypyverse / tidypandas

A grammar of data manipulation for pandas inspired by tidyverse
https://tidypyverse.github.io/tidypandas/
MIT License
93 stars 7 forks source link

[enhancement] `mutate` should make minimal use of `pandas.groupby.apply` #21

Open talegari opened 2 years ago

talegari commented 2 years ago

and avert it as many cases as possible

talegari commented 11 months ago

(text of what was discussed with Ashish)

Here are a few improvements that can be done. Consider a window function like cumsum, an agg function like median which we intend to apply per group on some column. Right now, we do df.groupby(...).apply(whatever) where ... are grouping vars. This is not efficient as we are chunking the entire dataframe for one column and we are not utilizing pandas.Series.groupby(...).cumsum() (see).

This is what we can offer when someone uses a 'tidy' interface for mutate:

  1. mutate should gain an extra argument say use_series_groupby (True/False).
  2. Handle single column grouped mutates via Series groupby and not dataframe groupby, for example: iris_tidy.mutate({'out': (lambda x: x.cumsum(), 'sepal_length')}, by = ['species']) should be handled via iris['out'] = iris['sepal_length'].groupby(['species']).cumsum() when use_series_groupby is True. This should internally handle the case when the lambda function is an agg function. This can done by checking the length and then joining back to the original dataframe.
  3. When use_series_groupby is False or if second member of the mutate is a list with size > 1 (like iris.mutate({'out': ("x / np.sum(y)", ['sepal_length', 'petal_width'])}, by = ['species'])), we cannot avoid using dataframe groupby. But, we can efficiently implement this by creating a df with required columns only, in this case: species(grouping var), sepal_length, petal_width. Then, perform df.groupby(...). apply(whatever) as usual and then join back to original df with many columns. This workflow might slow down the simple df operation but will speed-up and use much lower memory on larger dataframes with especially lot of columns and many small groups.

@grahitr Add if you have anything I missed. I will create multiple issues out of this and implement them.