pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
42.68k stars 17.6k forks source link

Incorrect result with DateOffset.rollback for normalize=True #32616

Open kdebrab opened 4 years ago

kdebrab commented 4 years ago

Code Sample, a copy-pastable example if possible

from pandas.tseries.offsets import MonthBegin
MonthBegin(normalize=True).rollback('2020-4-1 12:34')
Out[2]: Timestamp('2020-03-01 00:00:00')

Problem description

It rollbacks to the beginning of the previous month instead of the current month. This happens only when normalize=True and the timestamp is within the first day of the month but strangely, it's correct for midnight:

from pandas.tseries.offsets import MonthBegin
MonthBegin(normalize=True).rollback('2020-4-1 00:00')
Out[3]: Timestamp('2020-04-01 00:00:00')

Expected Output

Out[2]: Timestamp('2020-04-01 00:00:00')

In general, I would expect the following to hold for any timestamp dt:

assert MonthBegin(normalize=True).rollback(dt) == MonthBegin().rollback(dt.normalize())

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.6.8.final.0 python-bits : 64 OS : Windows OS-release : 10 machine : AMD64 processor : Intel64 Family 6 Model 78 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.0.1 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 40.6.2 Cython : 0.29.4 pytest : 5.3.5 hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : 4.5.0 html5lib : 1.0.1 pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : 4.6.1 bottleneck : 1.3.1 fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.2.0 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : 5.3.5 pyxlsb : None s3fs : 0.2.1 scipy : 1.4.1 sqlalchemy : 1.3.13 tables : None tabulate : 0.8.6 xarray : 0.15.0 xlrd : 1.2.0 xlwt : 1.3.0 xlsxwriter : 1.2.8 numba : None
kdebrab commented 4 years ago

See also https://stackoverflow.com/a/56993025/1551810

ThomasA commented 8 months ago

I also experience this behaviour and it is also unexpected to me. Here more than 3 years later, sadly, the bug is still present in Pandas 2.1.2. Works as expected at midnight:

In [63]: ts = pd.Timestamp("2023-11-01 00:00:00", tz="UTC")

In [64]: pd.offsets.MonthBegin(normalize=True).rollback(ts)
Out[64]: Timestamp('2023-11-01 00:00:00+0000', tz='UTC')

In [65]: pd.offsets.MonthBegin().rollback(ts)
Out[65]: Timestamp('2023-11-01 00:00:00+0000', tz='UTC')

Unexpected behaviour at other times of the same day:

In [66]: ts = pd.Timestamp("2023-11-01 14:39:00", tz="UTC")

In [67]: pd.offsets.MonthBegin(normalize=True).rollback(ts)
Out[67]: Timestamp('2023-10-01 00:00:00+0000', tz='UTC')

In [68]: pd.offsets.MonthBegin().rollback(ts)
Out[68]: Timestamp('2023-11-01 14:39:00+0000', tz='UTC')
zephyr707 commented 7 months ago

I just got bit by this bug and did not understand why normalize results in such unusual behaviour between midnight and a second after midnight. I am using Pandas 1.2.4, does the latest 2.1.3 fix this bug? Thanks to kdebrab for the stackoverflow link, looks like that is a solution to get around the bug.