pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.19k stars 17.77k forks source link

BUG: reset_index() looses the frequency of a DatetimeIndex #59273

Open annika-rudolph opened 1 month ago

annika-rudolph commented 1 month ago

Pandas version checks

Reproducible Example

>>> index = pd.DatetimeIndex(pd.date_range(start="2000", freq = 'YS', periods = 10), name = 'Date')
>>> df = pd.DataFrame(data=list(range(10)), index = index)
>>> print(df.index.freq)
<YearBegin: month=1>
>>> print(df.reset_index()['Date']._values.freq)
None
>>> df = df.reset_index().set_index('Date')
>>> print(df.index.freq)
None

Issue Description

When doing reset_index() on a DatetimeIndex this leads to the frequency being lost. Although the newly created column is a DatetimeArray, it does not seem to carry the freq attribute. As a result, when doing reset_index() -> set_index() I cannot restore the original index which potentially creates issues.

Expected Behavior

I would expect that reset_index().set_index() let's me recover the original index :)

Installed Versions

INSTALLED VERSIONS

commit : bfe5be01fef4eaecf4ab033e74139b0a3cac4a39 python : 3.10.12 python-bits : 64 OS : Linux OS-release : 5.15.153.1-microsoft-standard-WSL2 Version : #1 SMP Fri Mar 29 23:14:13 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : C.UTF-8 LOCALE : en_US.UTF-8

pandas : 0+untagged.34794.gbfe5be0 numpy : 1.26.4 pytz : 2024.1 dateutil : 2.9.0.post0 pip : 22.0.2 Cython : 3.0.10 sphinx : 7.3.7 IPython : 8.23.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : 4.12.3 blosc : None bottleneck : 1.3.8 fastparquet : 2024.2.0 fsspec : 2024.3.1 html5lib : 1.1 hypothesis : 6.100.1 gcsfs : 2024.3.1 jinja2 : 3.1.3 lxml.etree : 5.2.1 matplotlib : 3.8.4 numba : 0.59.1 numexpr : 2.10.0 odfpy : None openpyxl : 3.1.2 psycopg2 : 2.9.9 pymysql : 1.4.6 pyarrow : 16.0.0 pyreadstat : 1.2.7 pytest : 8.1.1 python-calamine : None pyxlsb : 1.0.10 s3fs : 2024.3.1 scipy : 1.13.0 sqlalchemy : 2.0.29 tables : 3.9.2 tabulate : 0.9.0 xarray : 2024.3.0 xlrd : 2.0.1 xlsxwriter : 3.2.0 zstandard : 0.22.0 tzdata : 2024.1 qtpy : None pyqt5 : None

aram-cinnamon commented 1 month ago

take

aram-cinnamon commented 1 month ago

I did some digging, and it seems it's intended that freq becomes None in a column: https://github.com/pandas-dev/pandas/blob/18a3eec55523513c5e08fe014646c044cc825fa4/pandas/core/internals/blocks.py#L2158-L2160 The above was added in this PR https://github.com/pandas-dev/pandas/pull/41425, which mentions that "The long-term behavior is definitely going to always drop the freq (more specifically, DTA/TDA won't have freq, xref https://github.com/pandas-dev/pandas/issues/31218). So this PR standardizes always-dropping."

@annika-rudolph What do you think? Also @jbrockmendel @jreback @mroeschke @jorisvandenbossche you created/reviewed/were mentioned in the PR. What are your thoughts on this issue?

annika-rudolph commented 1 month ago

Thanks for digging into this! It is what I suspected :)

From a user perspective I can say that frequencies in DatetimeIndices are quite important, even more so since some functionality (like businessday and resample) will be dropped for Periodindices -- which for us means that we recently moved everything to DatetimeIndices. Thus, it would be nice if they could cover the same functionality as Periodindices and specifically, the frequency attribute could be retained in all transformations. Reset_index() -> set_index() is a common pattern that I see a lot when working with MultiIndices, which is also very relevant in many of my projects.

It seems to me that the decision on always dropping the frequency was taken some time ago (before deciding to drop PeriodIndex functionality?), so maybe it could be reconsidered?

yuanx749 commented 1 month ago

I encountered this issue and did some debugging. It is the reshape below that leads to loss of freq. https://github.com/pandas-dev/pandas/blob/1fa50252e1f411dbd5ee37b45f3ee602b39fd68c/pandas/core/internals/blocks.py#L2313 But as mentioned by @aram-cinnamon , I think this behaviour is expected.