pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.22k stars 17.78k forks source link

BUG: Inconsistent behaviour between excel engines when writing multiple times using the same ExcelWriter #40289

Open mrob95 opened 3 years ago

mrob95 commented 3 years ago

Code Sample, a copy-pastable example

import pandas as pd
import io

df = pd.DataFrame({"A": [0, 1], "B": [10, 11]})

for engine in ["openpyxl", "xlsxwriter", "odf"]:
    excel = io.BytesIO()
    with pd.ExcelWriter(excel, engine=engine) as writer:
        df.to_excel(writer, sheet_name="foo", columns=["A"], index=False)
        df.to_excel(writer, sheet_name="foo", columns=["B"], startcol=1, index=False)

    result = pd.read_excel(excel, "foo")
    print(engine)
    print(result)
    print()

Produces:

openpyxl
   A   B
0  0  10
1  1  11

xlsxwriter
   A   B
0  0  10
1  1  11

odf
     A Unnamed: 1
0  0.0        NaN
1  1.0        NaN
2  NaN          B
3  NaN         10
4  NaN         11

Adding this test to test_writers.py (line 1327) will also catch the issue:


    def test_excel_multiple_writes_one_writer(self, path):
        df = DataFrame({"A": [0, 1], "B": [10, 11]})

        with ExcelWriter(path) as writer:
            df.to_excel(writer, columns=["A"], index=False)
            df.to_excel(writer, columns=["B"], startcol=1, index=False)

        result = pd.read_excel(path)
        tm.assert_frame_equal(result, df)

Problem description

This came up when implementing #40231, the behaviour between the engines here seems to be inconsistent.

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : 6baec67643923ae93b685bdec01105b51f1c0a94 python : 3.7.3.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 60 Stepping 3, GenuineIntel byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : None.None pandas : 1.3.0.dev0+979.g6baec67643.dirty numpy : 1.20.0 pytz : 2019.1 dateutil : 2.8.1 pip : 20.3.3 setuptools : 41.4.0 Cython : 0.29.21 pytest : 6.2.2 hypothesis : 6.1.1 sphinx : 2.4.0 blosc : None feather : 0.4.0 xlsxwriter : 1.3.7 lxml.etree : 4.4.1 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.10.1 IPython : 7.5.0 pandas_datareader: None bs4 : 4.8.0 bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : 3.1.0 numexpr : 2.7.1 odfpy : None openpyxl : 3.0.6 pandas_gbq : None pyarrow : 0.13.0 pyxlsb : None s3fs : None scipy : 1.3.0 sqlalchemy : 1.3.13 tables : 3.6.1 tabulate : None xarray : None xlrd : 2.0.1 xlwt : 1.3.0 numba : None
lithomas1 commented 3 years ago

I can confirm the above and this seems like an issue with the odf writer(possibly due to the lack of support for append mode), as I tested writing the data to a file and opening it in Excel(I don't have Apache OpenOffice) and got this. image