Closed shoyer closed 2 months ago
OK, after digging more deeply into it.... I discover that M
corresponds to the offset "month end", which apparently means the start of the last day of the month. To get "month start", I need to use MS
(or likewise, QS
or AS
).
This is... deeply non-intuitive.
I wish there was a way to change this without breaking a bunch of existing code.
I suppose we could add ME
, etc., for month end, but changing the offset M
from month-end to month-start seems like a non-starter. Ugh. So I guess we're left with a documentation issue (https://github.com/pydata/pandas/issues/5023), unless we want to add a hack for resample.
IIRC M
matches what scikit-timeseries
established, a long established pattern. You could change it but that would prob require a long deprecation cycle (IMHO not worth it).
also xref to #9528 as the fill args to .resample
are different from up/down resampling. Further I think the docs to resample need much more. my2c.
IIRC M matches what scikit-timeseries established, a long established pattern. You could change it but that would prob require a long deprecation cycle (IMHO not worth it).
Given how confusing 'M'
is, I think it might actually be worth it
Right, time for another @pandas-dev/pandas-core @pandas-dev/pandas-triage tag
Would anyone object to a deprecation cycle to remove 'M'
and rename it to 'ME'
? Likewise for 'Y'
-> 'YE'
It's a really common source of confusion, nobody expects 'M'
to mean "month end" (or 'Y'
to mean "year end") the first time they see it
Concretely, I'd suggest:
freq='M'
warns users that 'M'
is deprecated and to use 'MS'
(month start) or 'ME'
(month end) insteadfreq='M'
errors, advising users to use 'MS'
(month start) or 'ME'
(month end) insteadFWIW in finance the APIs typically use ME for "month end" and M for "monthly" when dealing with scheduling. They might have different contexts but M is never used for month end. I would support the change even if its a long way to go.
Not sure if there is any convention followed by other apps, or anything, but seems like an improvement. The current API seems quite confusing indeed. And changing the behavior seems easy if M
won't be used by anything else after the change, which would make the change way trickier.
Just to be clear, we're talking about the rule
(not freq
) parameter of DataFrame.resample (and I guess Series.resample
), right? Or am I missing something?
I'd be OK with this deprecation. I get bit by this all of the time
Just to be clear, we're talking about the rule (not freq)
Yes, that's right, thanks - in resample
it's rule
and in date_range
it's freq
, but that's what I meant
Would anyone object to a deprecation cycle to remove
'M'
and rename it to'ME'
? Likewise for'Y'
->'YE'
You only mention monthly and yearly, but the other frequencies that default to right might have the same issue? (although less used): quarterly (Q), weekly (W), and then the business version of monthly/quarterly/yearly.
Thanks - yes, those too
if M won't be used by anything else after the change
"M" and "Y" would presumably still be used for Period/PeriodDtype.
There are occasional issues with people being surprised that freq
s which are valid for date_range
are not valid for Period
, so it might be helpful to make a cleaner separation between the two concepts. e.g. #38859, #13871, #5091
I would like to work on this issue. Considering what @MarcoGorelli suggested:
Concretely, I'd suggest:
* pandas 2.1: passing `freq='M'` warns users that `'M'` is deprecated and to use `'MS'` (month start) or `'ME'` (month end) instead * pandas 3.0: passing `freq='M'` errors, advising users to use `'MS'` (month start) or `'ME'` (month end) instead
I’ll start with warnings while passing freq=‘M’.
When I worked in finance, we would do present value valuations to calculate interest rate risk. These kind of inconsistencies were present in some of the software, and they were a nightmare to detect and circumvent. Securities are quite standartized, so if bonds in a particular country mostly pay the coupon at the start of a period, they do so no matter what the period is.
Fixing this would save a lot of people somewhere a lot of hours.
"M" and "Y" would presumably still be used for Period/PeriodDtype.
Are we sure we want this?
There's an example in the docs which shows:
In[366]: p = pd.Period("2014-07", freq="M")
In[367]: p + pd.offsets.MonthEnd(3)
Out[367]: Period('2014-10', 'M')
In[368]: p + pd.offsets.MonthBegin(3)
Traceback
...
ValueError: Input has different freq from Period(freq=M)
@jbrockmendel If the prefix were to stay as "M"
for Period
but "ME"
for the offsets, then wouldn't that cause confusion when people try to sum an offset to a Period
? Would people be expected to know that MonthEnd
is the offset corresponding to the Period
'M'
?
I think it might be simpler (and easier to teach) to just use 'MS'
and 'ME'
, without special-casing Period
From today's call: seeing as long-term the idea is to decouple Period from Offsets, both p + pd.offsets.MonthEnd(3)
and p + pd.offsets.MonthStart(3)
would raise, and so it would probably be best to keep 'M'
for Period
From today's call: seeing as long-term the idea is to decouple Period from Offsets
To be clear, this is something id like to see, but have no concrete plans to actually implement. There hasn't been a targeted discussion of the idea, except for the mention of it yesterday and lack of objection.
This has been addressed, so I think the issue can be closed
For anyone stumbling across this issue because they're using 'M'
in resample
, I'd highly suggest using 'MS' instead. I'll be consistent across pandas versions, is a more natural way of thinking about time series, and is very likely what you were actually looking for
Example:
'6MS'
means that the windows are:
xref #2665 xref #5440
Resample appears to be use an inconsistent label convention depending on whether the target frequency is sub-daily/daily or super-daily:
label='left'
makes labels at the timestamp corresponding to the start of each frequency bin, andlabel='right'
that makes labels at that timestamp plus the frequency (at the timestamp dividing exactly dividing bins).'left'
to'right'
! My guess is that the default was changed here because users were confused bylabel='left'
no longer falling inside the expected interval. (I guess I could checkgit blame
for the details.)I found this behavior quite surprising and confusing. Is it intentional? I would like to rationalize this if possible, because this strikes me as very poor design. The behavior also couples in a weird way with the
closed
argument (see the linked issues).From my perspective (as someone who uses monthly and yearly data), the sub-daily/daily behavior makes sense and the super-daily behavior is a bug: there's no particular reason why it makes sense to use 1 day as an offset for frequencies with super-daily resolution.
CC @Cd48 @kdebrab
Here's my test script: