pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.75k stars 17.96k forks source link

BUG: Inconsistent output type in Excel for PeriodIndex in Index or MultiIndex #60099

Open pforero opened 2 weeks ago

pforero commented 2 weeks ago

Pandas version checks

Reproducible Example

import pandas as pd

period_index = pd.period_range(start="2006-10-06", end="2006-10-07", freq="D")
pi_index = pd.Series(0, index=period_index)
pi_multi = pd.Series(0, index=pd.MultiIndex.from_arrays([period_index, [0, 0]]))

pi_index.to_excel("index_as_date_openpyxl.xlsx", engine="openpyxl", index=True)
pi_multi.to_excel("index_as_str_openpyxl.xlsx", engine="openpyxl", index=True)

pi_index.to_excel("index_as_date_xlsxwriter.xlsx", engine="xlsxwriter",  index=True)
pi_multi.to_excel("index_as_str_xlsxwriter.xlsx", engine="xlsxwriter", index=True)

Issue Description

When exporting a Series/DataFrame to excel when the index contains a PeriodIndex the format of the output depends on the index being an Index or a MultiIndex.

When the PeriodIndex is in an Index, the output format in Excel is a date. When the PeriodIndex is a level from a MultiIndex, the output format is a string.

Expected Behavior

The expected behaviour is that the format of the period would not be dependent on it being in an Index or in a MultiIndex. Being either a string or a date in excel is both acceptable as long as it is consistent in both outputs.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 0691c5cf90477d3503834d983f69350f250a6ff7 python : 3.12.4 python-bits : 64 OS : Linux OS-release : 6.1.100+ Version : #1 SMP PREEMPT_DYNAMIC Sat Oct 5 14:28:44 UTC 2024 machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : None LOCALE : C.UTF-8 pandas : 2.2.3 numpy : 2.0.0 pytz : 2024.1 dateutil : 2.9.0.post0 pip : 24.2 Cython : None sphinx : None IPython : 8.25.0 adbc-driver-postgresql: None adbc-driver-sqlite : None bs4 : None blosc : None bottleneck : None dataframe-api-compat : None fastparquet : None fsspec : None html5lib : None hypothesis : None gcsfs : None jinja2 : None lxml.etree : None matplotlib : None numba : None numexpr : None odfpy : None openpyxl : 3.1.5 pandas_gbq : None psycopg2 : None pymysql : None pyarrow : 16.1.0 pyreadstat : None pytest : 8.2.2 python-calamine : None pyxlsb : None s3fs : None scipy : None sqlalchemy : 2.0.31 tables : None tabulate : None xarray : None xlrd : None xlsxwriter : 3.2.0 zstandard : None tzdata : 2024.1 qtpy : None pyqt5 : None
ZKaoChi commented 2 weeks ago

I think this is worth changing, can I try it ?

rhshadrach commented 1 week ago

Thanks for the report. For a non-MultiIndex, we explicitly convert these to timestamps here:

https://github.com/pandas-dev/pandas/blob/d11ed2f1193c7a45446a702170b8ca0368bc07d3/pandas/io/formats/excel.py#L765-L766

We should do something similar in the MultiIndex case. PRs are welcome!