Open kuraga opened 4 years ago
Not an answer so much as an empirical observation: using a context manager for file handling (which I think is generally recommended style) seems to solve this problem:
df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})
writer1 = pd.ExcelWriter('first.xlsx', engine='openpyxl')
df.to_excel(writer1)
writer1.save()
writer1.close()
with open('second.xlsx', 'wb') as output2:
writer2 = pd.ExcelWriter(output2, engine='openpyxl')
df.to_excel(writer2)
writer2.save()
@lordgrenville , thanks! It even works in my case (it's complicated than this issue's code).
Then, I'll add your example, too and 'll mark this as a bug. If it's not, people'll say :)
writer2.save()
writer2.close()
Remove one line. It will start to work!
I'm running into the same problem, or at least it seems very similar.
In my case, I'm trying to write a dataframe to an existing worksheet, using ExcelWriter
's append mode. When opening the resulting file excel-results.xlsx
, Excel (Office 365) warns me that it is corrupt and offers to repair. The repair does work, but of course, it shouldn't be necessary.
Some code to reproduce the problem:
import pandas as pd
from pathlib import Path
import shutil
from openpyxl import load_workbook
xlsx_template = Path("excel-template.xlsx")
xlsx_results = Path("excel-results.xlsx")
shutil.copy2(xlsx_template, xlsx_results)
df = pd.DataFrame(
{"type": ["ERROR", "NOTFOUND", "ERROR"],
"message": ["First error message", "Didn't find a value", "Another error"]}
)
with pd.ExcelWriter(xlsx_results, engine="openpyxl", mode="a") as writer:
writer.book = load_workbook(xlsx_results)
writer.sheets = {ws.title: ws for ws in writer.book.worksheets}
df.to_excel(writer, sheet_name="Error messages", startrow=7, startcol=1, index=False, header=False)
writer.save()
The excel-template.xlsx
file used here is this one.
Versions used to reproduce this:
INSTALLED VERSIONS
------------------
commit : 2cb96529396d93b46abab7bbc73a208e708c642e
python : 3.7.10.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.17763
machine : AMD64
processor : Intel64 Family 6 Model 158 Stepping 10, GenuineIntel
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : None.None
pandas : 1.2.4
numpy : 1.20.2
pytz : 2021.1
dateutil : 2.8.1
pip : 21.1.1
setuptools : 49.6.0.post20210108
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : None
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : 3.0.7
pandas_gbq : None
pyarrow : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : None
Adding some flavour to the issue:
The behaviour is identical with what was already reported, but with the difference that Excel recovery cannot restore the records. This leaves a discrepancy of ~2K records between the dataframe and the Excel file created.
DF: [7271 rows x 20 columns] Excel: 5152 rows
There is no filtering, straight with pd.ExcelWriter(temp_name, engine='xlsxwriter') as writer: df_result_set.to_excel(writer, sheet_name='Dataset', index=False) writer.save() writer.close()
The recovery message from Excel indicates functions have been removed, yet all the DF columns are text data, no Excel or udf functions referenced at all.
Recovery log from Excel:
Package Version
2to3 1.0 amqp 5.0.6 argh 0.26.2 backcall 0.2.0 beautifulsoup4 4.9.3 beautifulsoup4-helpers 0.0.1 billiard 3.6.4.0 blinker 1.4 celery 5.1.2 certifi 2021.5.30 charset-normalizer 2.0.4 click 7.1.2 click-didyoumean 0.0.3 click-plugins 1.1.1 click-repl 0.2.0 colorama 0.4.4 dateparser 1.0.0 debugpy 1.4.1 decorator 5.0.9 docopt 0.6.2 et-xmlfile 1.1.0 Flask 2.0.1 greenlet 1.1.1 idna 3.2 install 1.3.4 ipykernel 6.0.3 ipython 7.26.0 ipython-genutils 0.2.0 itsdangerous 2.0.1 jedi 0.18.0 Jinja2 3.0.1 joblib 1.0.1 jupyter-client 6.1.12 jupyter-core 4.7.1 kit 0.2.15 kombu 5.1.0 langdetect 1.0.9 language 0.6 MarkupSafe 2.0.1 matplotlib-inline 0.1.2 natural 0.2.0 nltk 3.6.2 numpy 1.21.1 openpyxl 3.0.7 pandas 1.3.1 parso 0.8.2 pickleshare 0.7.5 pip 21.2.4 prompt-toolkit 3.0.19 PyDispatcher 2.0.5 Pygments 2.9.0 python-dateutil 2.8.2 pytz 2021.1 pywin32 301 PyYAML 5.4.1 pyzmq 22.2.1 regex 2021.8.3 requests 2.26.0 selenium 3.141.0 setuptools 56.0.0 six 1.16.0 soupsieve 2.2.1 SQLAlchemy 1.4.22 tool 0.8.0 tornado 6.1 tqdm 4.62.0 traitlets 5.0.5 tzlocal 2.1 urllib3 1.26.6 vine 5.0.0 wcwidth 0.2.5 Werkzeug 2.0.1 wheel 0.36.2 XlsxWriter 3.0.1
[x] I have checked that this issue has not already been reported.
[x] I have confirmed this bug exists on the latest version of pandas.
[ ] (optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
first.py
:second.py
:UPD 2:
Remove one line. It will start to work!
UPD:
third.py
(see https://github.com/pandas-dev/pandas/issues/33746#issuecomment-640169769 by @lordgrenville):Problem description
~(Note: 9474 = 2 * 4737. But sometimes it's not true.)~
path - str - Path to xls or xlsx file.
So doespd.ExcelWriter.__init__
accept a file-like object?Output of
pd.show_versions()