pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.22k stars 17.78k forks source link

BUG: ExcelWriter: file is corrupted on save (and: does it accept a file object?) #33746

Open kuraga opened 4 years ago

kuraga commented 4 years ago

Code Sample, a copy-pastable example

first.py:

import pandas as pd
import openpyxl

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

writer1 = pd.ExcelWriter('first.xlsx', engine='openpyxl')
df.to_excel(writer1)
writer1.save()
writer1.close()

second.py:

import pandas as pd
import openpyxl

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

output2 = open('second.xlsx', 'wb')
writer2 = pd.ExcelWriter(output2, engine='openpyxl')
df.to_excel(writer2)
writer2.save()
writer2.close()
output2.flush()
output2.close()

UPD 2:

writer2.save()
writer2.close()

Remove one line. It will start to work!

UPD: third.py (see https://github.com/pandas-dev/pandas/issues/33746#issuecomment-640169769 by @lordgrenville):

import pandas as pd
import openpyxl

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

with open('third.xlsx', 'wb') as output3:
    writer3 = pd.ExcelWriter(output3, engine='openpyxl')
    df.to_excel(writer3)
    writer3.save()

Problem description

$ python first.py
$ python second.py
$ du -b first.xlsx second.xlsx third.xlsx
4737    first.xlsx
9474    second.xlsx
4737    third.xlsx

~(Note: 9474 = 2 * 4737. But sometimes it's not true.)~

  1. Why files differ?
  2. Documentation says: path - str - Path to xls or xlsx file. So does pd.ExcelWriter.__init__ accept a file-like object?

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.7.7.final.0 python-bits : 64 OS : Linux OS-release : 5.4.32-calculate machine : x86_64 processor : Intel(R) Core(TM) i5-7200U CPU @ 2.50GHz byteorder : little LC_ALL : None LANG : ru_RU.utf8 LOCALE : ru_RU.UTF-8 pandas : 1.0.3 numpy : 1.18.1 pytz : 2019.3 dateutil : 2.8.1 pip : 20.0.2 setuptools : 46.1.3.post20200330 Cython : 0.29.15 pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : 1.2.8 lxml.etree : 4.5.0 html5lib : None pymysql : None psycopg2 : None jinja2 : 2.11.1 IPython : 7.13.0 pandas_datareader: None bs4 : None bottleneck : None fastparquet : None gcsfs : None lxml.etree : 4.5.0 matplotlib : 3.1.3 numexpr : None odfpy : None openpyxl : 3.0.3 pandas_gbq : None pyarrow : None pytables : None pytest : None pyxlsb : None s3fs : None scipy : 1.4.1 sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None xlsxwriter : 1.2.8 numba : None
lordgrenville commented 4 years ago

Not an answer so much as an empirical observation: using a context manager for file handling (which I think is generally recommended style) seems to solve this problem:

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

writer1 = pd.ExcelWriter('first.xlsx', engine='openpyxl')
df.to_excel(writer1)
writer1.save()
writer1.close()

with open('second.xlsx', 'wb') as output2:
    writer2 = pd.ExcelWriter(output2, engine='openpyxl')
    df.to_excel(writer2)
    writer2.save()
kuraga commented 4 years ago

@lordgrenville , thanks! It even works in my case (it's complicated than this issue's code).

Then, I'll add your example, too and 'll mark this as a bug. If it's not, people'll say :)

kuraga commented 4 years ago
writer2.save()
writer2.close()

Remove one line. It will start to work!

wwwald commented 3 years ago

I'm running into the same problem, or at least it seems very similar.

In my case, I'm trying to write a dataframe to an existing worksheet, using ExcelWriter's append mode. When opening the resulting file excel-results.xlsx, Excel (Office 365) warns me that it is corrupt and offers to repair. The repair does work, but of course, it shouldn't be necessary.

Some code to reproduce the problem:

import pandas as pd
from pathlib import Path
import shutil
from openpyxl import load_workbook

xlsx_template = Path("excel-template.xlsx")
xlsx_results = Path("excel-results.xlsx")
shutil.copy2(xlsx_template, xlsx_results)

df = pd.DataFrame(
    {"type": ["ERROR", "NOTFOUND", "ERROR"], 
     "message": ["First error message", "Didn't find a value", "Another error"]}
)

with pd.ExcelWriter(xlsx_results, engine="openpyxl", mode="a") as writer:
    writer.book = load_workbook(xlsx_results)
    writer.sheets = {ws.title: ws for ws in writer.book.worksheets}
    df.to_excel(writer, sheet_name="Error messages", startrow=7, startcol=1, index=False, header=False)
    writer.save()

The excel-template.xlsx file used here is this one.

Versions used to reproduce this:

INSTALLED VERSIONS  
------------------  
commit           : 2cb96529396d93b46abab7bbc73a208e708c642e  
python           : 3.7.10.final.0  
python-bits      : 64  
OS               : Windows  
OS-release       : 10  
Version          : 10.0.17763  
machine          : AMD64  
processor        : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel  
byteorder        : little  
LC_ALL           : None  
LANG             : en_US.UTF-8  
LOCALE           : None.None  

pandas           : 1.2.4  
numpy            : 1.20.2  
pytz             : 2021.1  
dateutil         : 2.8.1  
pip              : 21.1.1  
setuptools       : 49.6.0.post20210108  
Cython           : None  
pytest           : None  
hypothesis       : None  
sphinx           : None  
blosc            : None  
feather          : None  
xlsxwriter       : None  
lxml.etree       : None  
html5lib         : None  
pymysql          : None  
psycopg2         : None  
jinja2           : None  
IPython          : None  
pandas_datareader: None  
bs4              : None  
bottleneck       : None  
fsspec           : None  
fastparquet      : None  
gcsfs            : None  
matplotlib       : None  
numexpr          : None  
odfpy            : None  
openpyxl         : 3.0.7  
pandas_gbq       : None  
pyarrow          : None  
pyxlsb           : None  
s3fs             : None  
scipy            : None  
sqlalchemy       : None  
tables           : None  
tabulate         : None  
xarray           : None  
xlrd             : None  
xlwt             : None  
numba            : None  
nownc commented 3 years ago

Adding some flavour to the issue:

The behaviour is identical with what was already reported, but with the difference that Excel recovery cannot restore the records. This leaves a discrepancy of ~2K records between the dataframe and the Excel file created.

DF: [7271 rows x 20 columns] Excel: 5152 rows

There is no filtering, straight with pd.ExcelWriter(temp_name, engine='xlsxwriter') as writer: df_result_set.to_excel(writer, sheet_name='Dataset', index=False) writer.save() writer.close()

The recovery message from Excel indicates functions have been removed, yet all the DF columns are text data, no Excel or udf functions referenced at all. Recovery log from Excel: error008280_01.xml

Errors were detected in file 'FY22Q1m.xlsx'Removed Records: Formula from /xl/worksheets/sheet1.xml part_.

Package Version


2to3 1.0 amqp 5.0.6 argh 0.26.2 backcall 0.2.0 beautifulsoup4 4.9.3 beautifulsoup4-helpers 0.0.1 billiard 3.6.4.0 blinker 1.4 celery 5.1.2 certifi 2021.5.30 charset-normalizer 2.0.4 click 7.1.2 click-didyoumean 0.0.3 click-plugins 1.1.1 click-repl 0.2.0 colorama 0.4.4 dateparser 1.0.0 debugpy 1.4.1 decorator 5.0.9 docopt 0.6.2 et-xmlfile 1.1.0 Flask 2.0.1 greenlet 1.1.1 idna 3.2 install 1.3.4 ipykernel 6.0.3 ipython 7.26.0 ipython-genutils 0.2.0 itsdangerous 2.0.1 jedi 0.18.0 Jinja2 3.0.1 joblib 1.0.1 jupyter-client 6.1.12 jupyter-core 4.7.1 kit 0.2.15 kombu 5.1.0 langdetect 1.0.9 language 0.6 MarkupSafe 2.0.1 matplotlib-inline 0.1.2 natural 0.2.0 nltk 3.6.2 numpy 1.21.1 openpyxl 3.0.7 pandas 1.3.1 parso 0.8.2 pickleshare 0.7.5 pip 21.2.4 prompt-toolkit 3.0.19 PyDispatcher 2.0.5 Pygments 2.9.0 python-dateutil 2.8.2 pytz 2021.1 pywin32 301 PyYAML 5.4.1 pyzmq 22.2.1 regex 2021.8.3 requests 2.26.0 selenium 3.141.0 setuptools 56.0.0 six 1.16.0 soupsieve 2.2.1 SQLAlchemy 1.4.22 tool 0.8.0 tornado 6.1 tqdm 4.62.0 traitlets 5.0.5 tzlocal 2.1 urllib3 1.26.6 vine 5.0.0 wcwidth 0.2.5 Werkzeug 2.0.1 wheel 0.36.2 XlsxWriter 3.0.1