pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.35k stars 17.81k forks source link

pandas.DataFrame.rolling should accept Timedelta not DateOffset #24900

Open trendelkampschroer opened 5 years ago

trendelkampschroer commented 5 years ago

Code Sample, a copy-pastable example if possible

import pandas as pd
index=pd.date_range("2019-01-01-05:00:00", freq="B", periods=10)
x = pd.Series(1., index)
dt = Timedelta("2D")
x.rolling(dt).sum()
2001-01-01    1.0
2001-01-02    2.0
2001-01-03    2.0
2001-01-04    2.0
2001-01-05    2.0
2001-01-08    1.0
2001-01-09    2.0
2001-01-10    2.0
2001-01-11    2.0
2001-01-12    2.0
Freq: B, dtype: float64

x.rolling("B").sum()
...
ValueError: <BusinessDay> is a non-fixed frequency

Problem description

The documentation states that rolling can be used with DateOffset. In fact it can only be used with fixed freq DateOffsets, usage with non-fixed freq DateOffsets will raise. If I get it correctly all fixed freq DateOffsets can be represented as Timedelta instances? Wouldn't it make sense to allow only Timedelta instead of DateOffset for rolling operations.

Apologies, if I am missing a case where a fixed freq DateOffset cannot be expressed as a Timedelta. In any case the documentation should be more explicit about the admissible DateOffsets

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit: None pandas: 0.23.4 pytest: None pip: 10.0.1 setuptools: 39.2.0 Cython: None numpy: 1.15.4 scipy: 1.1.0 pyarrow: None xarray: None IPython: 6.5.0 sphinx: None patsy: None dateutil: 2.7.3 pytz: 2018.5 blosc: None bottleneck: None tables: None numexpr: None feather: None matplotlib: 3.0.2 openpyxl: None xlrd: None xlwt: None xlsxwriter: None lxml: None bs4: None html5lib: 1.0.1 sqlalchemy: None pymysql: None psycopg2: 2.6.2 (dt dec pq3 ext lo64) jinja2: 2.10 s3fs: None fastparquet: None pandas_gbq: None pandas_datareader: None/details>
TomAugspurger commented 5 years ago

If I get it correctly all non-fixed freq DateOffsets cam be represented as Timedelta instances?

I don't think so. Timedeltas always represent an absolute, fixed duration. A non-fixed offset like BusinessDay doesn't have a fixed number of nanoseconds.

trendelkampschroer commented 5 years ago

Yes of course, sorry for the confusion. What I meant to ask was the opposite: Can all fixed freq DateOffsets be represented as Timedelta instances?

Updated my comment above.

TomAugspurger commented 5 years ago

I'm not sure what the issue is then. Timedeltas are accepted in DataFrame.rolling.

On Thu, Jan 24, 2019 at 7:01 AM Benjamin Trendelkamp-Schroer < notifications@github.com> wrote:

Yes of course, sorry for the confusion. What I meant to ask was the opposite: Can all fixed freq DateOffsets be represented as Timedelta instances?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/pandas-dev/pandas/issues/24900#issuecomment-457188161, or mute the thread https://github.com/notifications/unsubscribe-auth/ABQHIuRAVAUvaNa2-2ngwzU0x7s6HHi9ks5vGa6UgaJpZM4aQhMt .

trendelkampschroer commented 5 years ago

Thanks for the quick reply. I should have been more precise: I meant that rolling should not accept DateOffsets, but int, and Timedelta instances (or str which can be cast to Timedelta).

My intent is twofold: i) I want to understand the difference between a fixed frequency DateOffset and a Timedelta ii) If they are equivalent (for purposes of rolling operations) then I want to stimulate the discussion that settles whether one is to prefer over the other for rolling operations. The ideal outcome would be (at least) a comment in the docstring or the examples section of pandas.DataFrame.rolling giving a clear indication of the preferred usage.

The docstring for pandas.DataFrame.rolling says:

window : int, or offset

Size of the moving window. This is the number of observations used for calculating the statistic. Each window will be a fixed size.

If its an offset then this will be the time period of each window. Each window will be a variable sized based on the observations included in the time-period. This is only valid for datetimelike indexes. This is new in 0.19.0

This suggest that you can use arbitrary DateOffsets but in fact only those with a fixed frequency are admissible. But if the only admissible offsets can as well be represented as a Timedelta than this should be made clear in the docstring or somewhere in the examples.

This also means that 'offset' might not be the best word to use here, as arbitrary offsets are not permitted.

If there is a rolling operation that can only be performed via DateOffsets and not via Timedeltas than I'd be eager to learn about it also.

mroeschke commented 5 years ago

Agreed that the rolling docstring could use clarification.

To answer your questions:

i) Essentially, there is very little difference between fixed frequency offsets (called Ticks internally but has not been really exposed in the documentation) and Timedeltas, e.g. pd.offsets.Hour() behaves the same as Timedelta(hour=1) arithmetically. Fixed frequencies exist to behave within the frequency system of pandas.

ii) There is no preference between the two when using the rolling operation.

Overall, we should specify the DateOffset must be fixed-frequency in the docstring.

trendelkampschroer commented 5 years ago

Thank you for your answer. Furthermore I'd encourage using Timedelta instead of DateOffset in the docstring.

As far as I understand a valid Timedelta will always work with rolling operations (for any DatetimeIndex) while DateOffset may raise if it is not fixed frequency.

I am emphasizing this as it took me me some time to realise this. With that understanding I found it now easier to design code that internally uses rolling operations.

This behaviour of rolling is also in stark contrast to resample for which a non fixed freq DateOffset is a valid argument.