pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.51k stars 17.88k forks source link

Deprecate groupby() squeeze option #32380

Closed dechamps closed 4 years ago

dechamps commented 4 years ago

Code Sample

import pandas as pd
print(pd.DataFrame([{
    'A': 1,
    'B': 1,
}, {
    'A': 2,
    'B': 2,
}, {
    'A': 2,
    'B': 3,
}]).groupby('A', squeeze=True).count())

Problem description

I expected .groupby(squeeze=True) to, well, squeeze, and count() to return a Series. Instead squeeze=True doesn't seem to do anything, and count() returns a DataFrame.

A workaround is to write .groupby('A').count().squeeze(), which does work.

Expected Output

A
1    1
2    2
Name: B, dtype: int64

Actual output

   B
A
1  1
2  2

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.6.final.0 python-bits : 64 OS : Linux OS-release : 5.4.0-4-amd64 machine : x86_64 processor : byteorder : little LC_ALL : None LANG : en_GB.UTF-8 LOCALE : en_GB.UTF-8 pandas : 1.0.1 numpy : 1.17.4 pytz : 2019.3 dateutil : 2.8.1 pip : 18.1 setuptools : 44.0.0 Cython : None pytest : 4.6.9 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : 4.5.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.12.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : None numexpr : None odfpy : None openpyxl : None pandas_gbq : None pyarrow : None pytables : None pytest : 4.6.9 pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : None numba : None
TomAugspurger commented 4 years ago

It doesn't appear to be documented, and I'm not familiar with it. @dechamps are you interested in walking through the code to see what it's intended for?

jreback commented 4 years ago

we should deprecate this option - i don’t think original usecases that i added are worth it

dechamps commented 4 years ago

It doesn't appear to be documented

Well the groupby() squeeze parameter does have documentation, it seems.

Personally I don't care much about the parameter - it's simple enough to just call squeeze() later anyway. However it is of course confusing if the parameter is there but does nothing.

WillAyd commented 4 years ago

+1 to deprecate as well; seems out of place as an argument here

MarcoGorelli commented 4 years ago

However it is of course confusing if the parameter is there but does nothing.

Here's an example of where it does something

In [2]: from pandas import DataFrame                                                                                                                                                                   

In [3]:     df3 = DataFrame( 
   ...:         [ 
   ...:             {"val1": 1, "val2": 20, 'val3': 1}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 2}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 3}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 4}, 
   ...:         ] 
   ...:     ) 
   ...:                                                                                                                                                                                                

In [4]: df3.set_index(['val1', 'val2']).groupby(['val1', 'val2'], squeeze=True).apply(sum)                                                                                                             
Out[4]: 
val3    10
dtype: int64

In [5]: df3.set_index(['val1', 'val2']).groupby(['val1', 'val2'], squeeze=False).apply(sum)                                                                                                            
Out[5]: 
           val3
val1 val2      
1    20      10

Anyway, @dechamps , are you interested in submitting a pull request to deprecate it? If so, see https://pandas.pydata.org/docs/development/contributing.html - else, I'd happily take it and see the groupby logic simplified :)

dechamps commented 4 years ago

Honestly I'd be prefer you do it - I have zero familiarity with Pandas development workflows.

mlyons-tcc commented 4 years ago

Here's a rant: You deprecated squeeze in 1.1.0 in violation of your own deprecation policy introduced in 1.0.0.

edit: Perhaps it was intended to throw a DeprecationWarning instead of the FutureWarning that was used. FutureWarning indicates that it has already been deprecated and user is still using it.

jreback commented 4 years ago

we can and will deprecate things in almost every version

what the policy is not to remove those depreciated until a next major version eg 2.0

we don't use DeprecationWarning because it's not shown by default and IMHO just useless

FutureWarning is visible

you don't have to change you code and can continue to use it if you would like

mlyons-tcc commented 4 years ago

Great to hear that it is not going away until 2.0! I'm much appreciative of the deprecation policy you provided, and I took the FutureWarning to mean something else since I expected a DeprecationWarning.

Thanks for providing the rationale of FutureWarning.

Arguments Against Using FutureWarning for Deprecations As I mentioned, I found the usage of FutureWarning to be ambiguous in terms of what the intentions were. It seemed to me that it could go away at any point in time and that a major release post deprecation had already happened. Or more scary that behavior was going to change since there is "existing use of FutureWarning to warn about constructs that will remain valid code in the future, but will have different semantics" (pep-0565). Otherwise, wouldn't I get a DeprecationWarning?

I think the big cause of confusion is that python changed its definition/recommendation of Deprecation and Future warnings in PEP-565, implemented in py3.7. Now instead of differentiating based on behavior, they are differentiating based on audience. "intended for other Python developers" as opposed to "intended for end users of applications that are written in Python". I think PEP565 has provided clear guidance that the type of warning that pandas should be providing for depreciations should in fact be a DeprecationWarning.

With regards to visibility, as of 3.7, DeprecationWarning are only visible if called from "__main__" by default. It's great for hiding the warnings from the application users. Recommendation is provided to use a test suite to make them visible. Warning visibility is also controllable in a number of ways so I definitely would not consider DeprecationWarning useless; it's actually quite proper.

Argument in Support of FutureWarning Unfortunately, none of the REPLs are making DeprecationWarnings generated from modules visible as far as I can tell. IPython even went so far as to say they want to hide deprecation warnings from modules because some of their dependencies produce a bunch :eyeroll:. Many in the scientific computing landscape depend solely on Jupyter so it would be unfortunate if they never get visibility into these warnings as they code/execute in a notebook environment. Because of that, as much as I truly believe these should be DeprecationWarnings, I see the importance for them to be FutureWarning if it is of most importance to surface these warnings to that community of users that are not first and foremost software devs.

Final Thought Using FutureWarning does cause a problem in that someone using pandas to create applications cannot easily ignore pandas deprecation warnings intended for developers without also ignoring warnings intended for the user of the application. If/when Jupyter decides to start surfacing DeprecationWarnings by default, I think it would be a good time to change the type of warning generated in pandas.

jreback commented 4 years ago

we are unlikely to change the warning type as visibility is most important

AlexanderNenninger commented 2 years ago

Hi,

I need exactly this behavior when applying functions to the GroupBy. Is there guidance on alternatives?

However it is of course confusing if the parameter is there but does nothing.

Here's an example of where it does something

In [2]: from pandas import DataFrame                                                                                                                                                                   

In [3]:     df3 = DataFrame( 
   ...:         [ 
   ...:             {"val1": 1, "val2": 20, 'val3': 1}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 2}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 3}, 
   ...:             {"val1": 1, "val2": 20, 'val3': 4}, 
   ...:         ] 
   ...:     ) 
   ...:                                                                                                                                                                                                

In [4]: df3.set_index(['val1', 'val2']).groupby(['val1', 'val2'], squeeze=True).apply(sum)                                                                                                             
Out[4]: 
val3    10
dtype: int64

In [5]: df3.set_index(['val1', 'val2']).groupby(['val1', 'val2'], squeeze=False).apply(sum)                                                                                                            
Out[5]: 
           val3
val1 val2      
1    20      10

Anyway, @dechamps , are you interested in submitting a pull request to deprecate it? If so, see https://pandas.pydata.org/docs/development/contributing.html - else, I'd happily take it and see the groupby logic simplified :)

brandonrwin commented 1 year ago

Hi,

I need exactly this behavior when applying functions to the GroupBy. Is there guidance on alternatives?

I believe it's

df3.set_index(['val1', 'val2']).groupby(['val1', 'val2']).apply(sum).squeeze()

DataFrameGroupBy.apply() returns a DataFrame, and you use Dataframe.squeeze() on that.