pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.78k stars 17.97k forks source link

ENH: Add DatetimeIndexResampler.nlargest #17791

Open edschofield opened 7 years ago

edschofield commented 7 years ago

Code Sample, a copy-pastable example if possible

With this setup:

import numpy as np
n = 1000
dates = pd.date_range(start='2010-01-01', periods=n)
rain_random = pd.Series(data=np.random.uniform(size=n), index=dates)

these two operations given different results:

rain_random.groupby(rain_random.index.year).nlargest(3)
rain_random.resample('A').nlargest(3)

Problem description

The Series.resample().nlargest() operation is inconsistent with DataFrame.resample()[column].nlargest() and the groupby equivalent. It emits a warning

Output:

/Users/schofield/miniconda/envs/py36/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: 
.resample() is now a deferred operation
You called nlargest(...) on this deferred object which materialized it into a series
by implicitly taking the mean.  Use .resample(...).mean() instead
  """Entry point for launching an IPython kernel.
Out[427]:
2010-12-31    0.507550
2012-12-31    0.490082
2011-12-31    0.478356
dtype: float64

Expected output:

Date        Date      
1930-12-31  1930-10-06      288.135370
            1930-10-05      285.587734
            1930-10-07      259.439935
            1930-10-08      227.587389
            1930-10-09      190.054844
1931-12-31  1931-01-26     3052.104566
            1931-01-25     2839.126102
            1931-01-29     2196.167129
            1931-02-01     1953.331709
            1931-01-27     1893.975328
1932-12-31  1932-01-19     9526.953864
            1932-01-20     4278.291105
            1932-03-03     2952.348903
            1932-03-02     2946.385433
            1932-03-04     2098.108897
pd.show_versions() output: INSTALLED VERSIONS ------------------ commit: None python: 3.6.1.final.0 python-bits: 64 OS: Darwin OS-release: 16.6.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: None LANG: en_AU.UTF-8 LOCALE: en_AU.UTF-8 pandas: 0.20.1 pytest: 3.0.7 pip: 9.0.1 setuptools: 27.2.0 Cython: 0.25.2 numpy: 1.12.1 scipy: 0.19.0 xarray: None IPython: 5.3.0 sphinx: 1.6.3 patsy: 0.4.1 dateutil: 2.6.0 pytz: 2017.2 blosc: None bottleneck: 1.2.1 tables: 3.4.2 numexpr: 2.6.2 feather: None matplotlib: 2.0.2 openpyxl: 2.4.7 xlrd: 1.0.0 xlwt: 1.2.0 xlsxwriter: 0.9.6 lxml: 3.7.3 bs4: 4.6.0 html5lib: 0.999 sqlalchemy: 1.1.9 pymysql: None psycopg2: None jinja2: 2.9.6 s3fs: None pandas_gbq: None pandas_datareader: 0.5.0
jreback commented 7 years ago

nlargest is not a first class operation on resample, so you need to do this.

In [4]: rain_random.resample('A').apply(lambda x: x.nlargest(3))
Out[4]: 
2010-12-31  2010-09-24    0.998530
            2010-04-27    0.997371
            2010-03-09    0.996582
2011-12-31  2011-11-30    0.999936
            2011-02-20    0.997470
            2011-01-17    0.992270
2012-12-31  2012-07-23    0.999762
            2012-06-20    0.998130
            2012-02-25    0.998010
dtype: float64
discort commented 6 years ago

@jreback @sinhrks The bug is not reproducible now:

In [14]: rain_random.resample('A').nlargest(3)
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-14-ec29cc197ee8> in <module>()
----> 1 rain_random.resample('A').nlargest(3)

/Users/discort/python/fun/pandas/pandas/core/resample.py in __getattr__(self, attr)
     96             return self[attr]
     97
---> 98         return object.__getattribute__(self, attr)
     99
    100     def __iter__(self):

AttributeError: 'DatetimeIndexResampler' object has no attribute 'nlargest'

vesion

INSTALLED VERSIONS ------------------ commit: 9122952d8c202854a2f48f2b52830839c10ee486 python: 3.5.3.candidate.1 python-bits: 64 OS: Darwin OS-release: 16.7.0 machine: x86_64 processor: i386 byteorder: little LC_ALL: en_US.UTF-8 LANG: en_US.UTF-8 LOCALE: en_US.UTF-8 pandas: 0.24.0.dev0+510.g9122952d8 pytest: 3.7.2 pip: 18.0 setuptools: 33.1.1 Cython: 0.28.4 numpy: 1.12.0 scipy: None pyarrow: None xarray: None IPython: 5.2.2 sphinx: 1.6.6 patsy: None dateutil: 2.6.0 pytz: 2016.10 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 2.0.0 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 0.9999999 sqlalchemy: None pymysql: None psycopg2: None jinja2: 2.9.5 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None gcsfs: None
hellojinwoo commented 4 years ago

I am using pandas version 1.0.1 and rain_random.resample('A').nlargest(3) is still not working. Hope this function is added in the following updates.

AttributeError                            Traceback (most recent call last)
<ipython-input-102-7057e1436432> in <module>
----> 1 rain_random.resample('A').nlargest(3)

~\anaconda3\lib\site-packages\pandas\core\resample.py in __getattr__(self, attr)
    105             return self[attr]
    106 
--> 107         return object.__getattribute__(self, attr)
    108 
    109     def __iter__(self):

AttributeError: 'DatetimeIndexResampler' object has no attribute 'nlargest'
jreback commented 4 years ago

pull requests are accepted; this is how issues get addressed in open source

hellojinwoo commented 4 years ago

pull requests are accepted; this is how issues get addressed in open source

May I ask what you mean by "this is how issues get addressed in open source?"

jreback commented 4 years ago

pandas and virtually all open source project are all volunteer

the core team will review pull requests

since there are 3000+ open issue most patches must come from the community

issues get fixed when folks like you open pull requests

hellojinwoo commented 4 years ago

pandas and virtually all open source project are all volunteer

the core team will review pull requests

since there are 3000+ open issue most patches must come from the community

issues get fixed when folks like you open pull requests

Yeah I know that Pandas is an open-source project. But regarding this issue resample('D').nlargest(3), I cannot see neither the assignees nor linked pull requests, which can be found on the right side of this webpage. So I was curious to know what you meant by "pull requests are accepted".

And since this issue was raised about 2 years and a half ago, I just wanted to point out that this has not been resolved yet. So it made me puzzle a little bit when you said "this is how issues get addressed in open source".

jreback commented 4 years ago

there are no assignees (who would we assign?)

and PRs would be linked to the issue

that’s the point here - no one has submitted anything

you or anyone else are welcome to do so

in this or any other issue

noting that something is not done is not that helpful - the issue is marked open

what IS helpful is submitting changes / examples / tests

hellojinwoo commented 4 years ago

there are no assignees (who would we assign?)

and PRs would be linked to the issue

that’s the point here - no one has submitted anything

you or anyone else are welcome to do so

in this or any other issue

noting that something is not done is not that helpful - the issue is marked open

what IS helpful is submitting changes / examples / tests

Now I can see why your replies have been sour. I am new to this pandas-dev zone, so if it is rude to report an old issue once again, I would like to apologize. You don't need to be sulky like that either because that IS NOT helpful either, right? Good day

jreback commented 4 years ago

@hellojinwoo thanks for the apology

we have been getting the: why has this x year old issue not been resolved

many times

and to be honest it’s very rude of folks to do this but i guess new folks just don’t realize this so ok

people work extremely hard in open source and volunteer much time - yet continued comments like this (and to be clear i am not calling you out at all) cause burnout for this thankless task

so thank you for commenting in the issue as i said above - if you would like to help out great