pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.55k stars 17.89k forks source link

Resampling uses inconsistent labeling for sub-daily and super-daily frequencies #9586

Closed shoyer closed 2 months ago

shoyer commented 9 years ago

xref #2665 xref #5440

Resample appears to be use an inconsistent label convention depending on whether the target frequency is sub-daily/daily or super-daily:

I found this behavior quite surprising and confusing. Is it intentional? I would like to rationalize this if possible, because this strikes me as very poor design. The behavior also couples in a weird way with the closed argument (see the linked issues).

From my perspective (as someone who uses monthly and yearly data), the sub-daily/daily behavior makes sense and the super-daily behavior is a bug: there's no particular reason why it makes sense to use 1 day as an offset for frequencies with super-daily resolution.

CC @Cd48 @kdebrab


Here's my test script:

for orig_freq, target_freq in [('20s', '1min'), ('20min', '1H'), ('10H', '1D'),
                               ('3D', '10D'), ('10D', '1M'), ('1M', 'Q'), ('3M', 'A')]:
    print '%s -> %s:' % (orig_freq, target_freq)
    ind = pd.date_range('2000-01-01', freq=orig_freq, periods=10)
    s = pd.Series(np.arange(10), ind)
    print 'default', s.resample(target_freq, how='first').index[0]
    print 'left', s.resample(target_freq, label='left', how='first').index[0]
    print 'right', s.resample(target_freq, label='right', how='first').index[0]
20s -> 1min:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-01 00:01:00
20min -> 1H:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-01 01:00:00
10H -> 1D:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-02 00:00:00
3D -> 10D:
default 2000-01-01 00:00:00
left 2000-01-01 00:00:00
right 2000-01-11 00:00:00
10D -> 1M:
default 2000-01-31 00:00:00
left 1999-12-31 00:00:00
right 2000-01-31 00:00:00
1M -> Q:
default 2000-03-31 00:00:00
left 1999-12-31 00:00:00
right 2000-03-31 00:00:00
3M -> A:
default 2000-12-31 00:00:00
left 1999-12-31 00:00:00
right 2000-12-31 00:00:00
shoyer commented 9 years ago

OK, after digging more deeply into it.... I discover that M corresponds to the offset "month end", which apparently means the start of the last day of the month. To get "month start", I need to use MS (or likewise, QS or AS).

This is... deeply non-intuitive.

I wish there was a way to change this without breaking a bunch of existing code.

I suppose we could add ME, etc., for month end, but changing the offset M from month-end to month-start seems like a non-starter. Ugh. So I guess we're left with a documentation issue (https://github.com/pydata/pandas/issues/5023), unless we want to add a hack for resample.

jreback commented 9 years ago

IIRC M matches what scikit-timeseries established, a long established pattern. You could change it but that would prob require a long deprecation cycle (IMHO not worth it).

also xref to #9528 as the fill args to .resample are different from up/down resampling. Further I think the docs to resample need much more. my2c.

MarcoGorelli commented 1 year ago

IIRC M matches what scikit-timeseries established, a long established pattern. You could change it but that would prob require a long deprecation cycle (IMHO not worth it).

Given how confusing 'M' is, I think it might actually be worth it

MarcoGorelli commented 1 year ago

Right, time for another @pandas-dev/pandas-core @pandas-dev/pandas-triage tag

Would anyone object to a deprecation cycle to remove 'M' and rename it to 'ME'? Likewise for 'Y' -> 'YE'

It's a really common source of confusion, nobody expects 'M' to mean "month end" (or 'Y' to mean "year end") the first time they see it

Concretely, I'd suggest:

attack68 commented 1 year ago

FWIW in finance the APIs typically use ME for "month end" and M for "monthly" when dealing with scheduling. They might have different contexts but M is never used for month end. I would support the change even if its a long way to go.

datapythonista commented 1 year ago

Not sure if there is any convention followed by other apps, or anything, but seems like an improvement. The current API seems quite confusing indeed. And changing the behavior seems easy if M won't be used by anything else after the change, which would make the change way trickier.

Just to be clear, we're talking about the rule (not freq) parameter of DataFrame.resample (and I guess Series.resample), right? Or am I missing something?

WillAyd commented 1 year ago

I'd be OK with this deprecation. I get bit by this all of the time

MarcoGorelli commented 1 year ago

Just to be clear, we're talking about the rule (not freq)

Yes, that's right, thanks - in resample it's rule and in date_range it's freq, but that's what I meant

jorisvandenbossche commented 1 year ago

Would anyone object to a deprecation cycle to remove 'M' and rename it to 'ME'? Likewise for 'Y' -> 'YE'

You only mention monthly and yearly, but the other frequencies that default to right might have the same issue? (although less used): quarterly (Q), weekly (W), and then the business version of monthly/quarterly/yearly.

MarcoGorelli commented 1 year ago

Thanks - yes, those too

jbrockmendel commented 1 year ago

if M won't be used by anything else after the change

"M" and "Y" would presumably still be used for Period/PeriodDtype.

There are occasional issues with people being surprised that freqs which are valid for date_range are not valid for Period, so it might be helpful to make a cleaner separation between the two concepts. e.g. #38859, #13871, #5091

natmokval commented 1 year ago

I would like to work on this issue. Considering what @MarcoGorelli suggested:

Concretely, I'd suggest:

* pandas 2.1: passing `freq='M'` warns users that `'M'` is deprecated and to use `'MS'` (month start) or `'ME'` (month end) instead

* pandas 3.0: passing `freq='M'` errors, advising users to use `'MS'` (month start) or `'ME'` (month end) instead

I’ll start with warnings while passing freq=‘M’.

AlexKirko commented 1 year ago

When I worked in finance, we would do present value valuations to calculate interest rate risk. These kind of inconsistencies were present in some of the software, and they were a nightmare to detect and circumvent. Securities are quite standartized, so if bonds in a particular country mostly pay the coupon at the start of a period, they do so no matter what the period is.

Fixing this would save a lot of people somewhere a lot of hours.

MarcoGorelli commented 1 year ago

"M" and "Y" would presumably still be used for Period/PeriodDtype.

Are we sure we want this?

There's an example in the docs which shows:

In[366]: p = pd.Period("2014-07", freq="M")

In[367]: p + pd.offsets.MonthEnd(3)
Out[367]: Period('2014-10', 'M')

In[368]: p + pd.offsets.MonthBegin(3)
Traceback
   ...
ValueError: Input has different freq from Period(freq=M)

@jbrockmendel If the prefix were to stay as "M" for Period but "ME" for the offsets, then wouldn't that cause confusion when people try to sum an offset to a Period? Would people be expected to know that MonthEnd is the offset corresponding to the Period 'M'?

I think it might be simpler (and easier to teach) to just use 'MS' and 'ME', without special-casing Period

MarcoGorelli commented 1 year ago

From today's call: seeing as long-term the idea is to decouple Period from Offsets, both p + pd.offsets.MonthEnd(3) and p + pd.offsets.MonthStart(3) would raise, and so it would probably be best to keep 'M' for Period

jbrockmendel commented 1 year ago

From today's call: seeing as long-term the idea is to decouple Period from Offsets

To be clear, this is something id like to see, but have no concrete plans to actually implement. There hasn't been a targeted discussion of the idea, except for the mention of it yesterday and lack of objection.

MarcoGorelli commented 2 months ago

This has been addressed, so I think the issue can be closed

For anyone stumbling across this issue because they're using 'M' in resample, I'd highly suggest using 'MS' instead. I'll be consistent across pandas versions, is a more natural way of thinking about time series, and is very likely what you were actually looking for

Example: