Open talegari opened 2 years ago
(text of what was discussed with Ashish)
Here are a few improvements that can be done. Consider a window function like cumsum
, an agg function like median
which we intend to apply per group on some column. Right now, we do df.groupby(...).apply(whatever)
where ...
are grouping vars. This is not efficient as we are chunking the entire dataframe for one column and we are not utilizing pandas.Series.groupby(...).cumsum()
(see).
This is what we can offer when someone uses a 'tidy' interface for mutate
:
mutate
should gain an extra argument say use_series_groupby
(True
/False
).iris_tidy.mutate({'out': (lambda x: x.cumsum(), 'sepal_length')}, by = ['species'])
should be handled via iris['out'] = iris['sepal_length'].groupby(['species']).cumsum()
when use_series_groupby
is True
. This should internally handle the case when the lambda function is an agg function. This can done by checking the length and then joining back to the original dataframe.use_series_groupby
is False
or if second member of the mutate is a list with size > 1 (like iris.mutate({'out': ("x / np.sum(y)", ['sepal_length', 'petal_width'])}, by = ['species'])
), we cannot avoid using dataframe groupby. But, we can efficiently implement this by creating a df with required columns only, in this case: species
(grouping var), sepal_length
, petal_width
. Then, perform df.groupby(...). apply(whatever)
as usual and then join back to original df with many columns. This workflow might slow down the simple df operation but will speed-up and use much lower memory on larger dataframes with especially lot of columns and many small groups.@grahitr Add if you have anything I missed. I will create multiple issues out of this and implement them.
and avert it as many cases as possible