pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.68k stars 17.6k forks source link

ENH: enable skipna on groupby reduction ops #15675

Open jreback opened 7 years ago

jreback commented 7 years ago

Edit[rhshadach]: The following methods do not have a skipna argument in groupby, but do have such an argument on the Series/DataFrame variant.


https://github.com/pandas-dev/pandas/issues/15674

In [19]: import pandas as pd
    ...: import numpy as np
    ...: d = {'l':  ['left', 'right', 'left', 'right', 'left', 'right'],
    ...:      'r': ['right', 'left', 'right', 'left', 'right', 'left'],
    ...:      'v': [-1, 1, -1, 1, -1, np.nan]}
    ...: df = pd.DataFrame(d)
    ...: 

In [20]: df.groupby('l').v.sum()
Out[20]: 
l
left    -3.0
right    2.0
Name: v, dtype: float64

In [21]: df.groupby('l').v.apply(lambda x: x.sum(skipna=False))
Out[21]: 
l
left    -3.0
right    NaN
Name: v, dtype: float64

ideally write [21] as df.groupby('l').v.sum(skipna=False)

mayukh18 commented 7 years ago

Can I take this up?

jreback commented 7 years ago

that would be great @mayukh18 !

mayukh18 commented 7 years ago

@jreback I was thinking about putting a passthru check for specifically the how == 'add' case a level above and after the actual cython implementation. Does this seem like a good idea?

jreback commented 7 years ago

no, you simply need to add this on to _groupby_function near the top. Needs testing for all ops which support skipna (pretty much all numeric ops)

mayukh18 commented 7 years ago

Adding this in _groupby_function will not effect any change for ops like mean, median etc as they are separately implemented. So, do those need the skipna too? Whereas, first and last ops don't seem like requiring skipna but they will be effected by modifying _groupby_function. I am naive so pardon me if I am asking silly stuff.

jreback commented 7 years ago

@mayukh18 yes the implementation is a bit muddled.

methods which are defined with an actual function (e.g. median/mean), this is very easy, just add to the validation function where numeric_only is now; that will pass it thru.

we can still use _groupby_function, what I would do is this.

add the default for skipna (so do this on all but first/last) sum = _groupby_function('sum', 'add', np.sum, skipna=True)

then add skipna=None in the signature for _groupby_function, which will add the arg if its not None to kwargs (so it gets passed).

mullenkamp commented 2 years ago

This is a surprisingly old issue, but this functionality would be really nice and consistent with the non-groupby methods.

vladu commented 2 years ago

Indeed, I believe this is not just a "nice-to-have" any more, but a necessary step before this FutureWarning can be resolved by users:

FutureWarning: Using the level keyword in DataFrame and Series aggregations is deprecated and will be removed in a future version. Use groupby instead. df.sum(level=1) should use df.groupby(level=1).sum().

As long as DataFrame.sum() and DataFrame.groupby().sum() (and other agg functions) have inconsistent APIs, dropping the level kwarg from non-grouped classes isn't really a good step, IMO.

andremcorreia commented 1 month ago

take

tiago-firmino commented 1 month ago

take