pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.81k stars 17.98k forks source link

BUG: to_excel inserts invalid data if a level in a MultiIndex is None #51252

Open musshorn opened 1 year ago

musshorn commented 1 year ago

Pandas version checks

Reproducible Example

import pandas as pd
import numpy as np

Cols = pd.MultiIndex.from_tuples([("Item", "Type 1", "Class A"), ("Item", "Type 1", "Class B"), ("Item", "Type 2", "Class A"), ("Item", "Type 2", "Class B"), ("Item", "Type 3", None)])

df = pd.DataFrame(np.random.randn(10, 5), columns=Cols)
print(df)
df.to_excel("Test.xlsx")

Issue Description

When a level in a MultiIndex is None, the resulting dataframe when exported to excel has an invalid level inserted replacing the None. See Image. PandasIssue

Expected Behavior

I would expect the result from to_excel should reflect the levels in the MultiIndex correctly.

Installed Versions

>>> pd.show_versions() INSTALLED VERSIONS ------------------ commit : 2e218d10984e9919f0296931d92ea851c6a6faf5 python : 3.10.0.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19044 machine : AMD64 processor : Intel64 Family 6 Model 79 Stepping 1, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : English_Australia.1252 pandas : 1.5.3 numpy : 1.22.4 pytz : 2022.2.1 dateutil : 2.8.2 setuptools : 57.4.0 pip : 23.0 Cython : 0.29.32 pytest : 7.2.0 hypothesis : None sphinx : 4.5.0 blosc : None feather : None xlsxwriter : 3.0.1 lxml.etree : 4.9.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 3.1.2 IPython : 8.1.1 pandas_datareader: None bs4 : 4.10.0 bottleneck : None brotli : 1.0.9 fastparquet : None fsspec : None gcsfs : None matplotlib : 3.5.1 numba : None numexpr : 2.8.4 odfpy : None openpyxl : 3.1.0 pandas_gbq : None pyarrow : None pyreadstat : None pyxlsb : None s3fs : None scipy : 1.8.0 snappy : None sqlalchemy : None tables : 3.7.0 tabulate : 0.8.9 xarray : None xlrd : None xlwt : 1.3.0 zstandard : 0.17.0 tzdata : None
phofl commented 1 year ago

I guess this is related to forward filling for round tripping purposes (see read_excel index_col documentation). But investigations welcome

musshorn commented 1 year ago

I think https://github.com/pandas-dev/pandas/blob/13db83ad3f1245107151e6a626537a8ea7f72df1/pandas/io/formats/excel.py#L641-L660 is the offending loop. When it reaches the last level, levels.take(level_codes) gets passed [0, 1, 0, 1, -1] which sets values to ['Class A', 'Class B', 'Class A', 'Class B', 'Class B'] and that's ultimately the row that gets written to Excel.

g-kisenkov commented 1 year ago

https://github.com/pandas-dev/pandas/blob/b9a4335b8a85276182e6fd7e6eae360eda6cc6ad/pandas/core/indexes/base.py#L1052

This variable is set to False for your case. When I manually set it to True in my debugger I get the expected Excel output.

Now it's left to figure out why it's set to False automatically...

Can I take it?

musshorn commented 1 year ago

Yeah all yours, I was just trying to provide some starting research for whoever took it up

g-kisenkov commented 1 year ago

take