BUG: ExcelWriter: file is corrupted on save (and: does it accept a file object?) #33746

Open kuraga opened 4 years ago

kuraga commented 4 years ago

Code Sample, a copy-pastable example

import pandas as pd
import openpyxl

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

writer1 = pd.ExcelWriter('first.xlsx', engine='openpyxl')

output2 = open('second.xlsx', 'wb')
writer2 = pd.ExcelWriter(output2, engine='openpyxl')

UPD: (see by @lordgrenville):

with open('third.xlsx', 'wb') as output3:
    writer3 = pd.ExcelWriter(output3, engine='openpyxl')

Problem description

$ python
$ python
$ du -b first.xlsx second.xlsx third.xlsx
4737    first.xlsx
9474    second.xlsx
4737    third.xlsx

~(Note: 9474 = 2 * 4737. But sometimes it's not true.)~

  1. Why files differ?
  2. Documentation says: path - str - Path to xls or xlsx file. So does pd.ExcelWriter.__init__ accept a file-like object?

Output of pd.show_versions()

lordgrenville commented 4 years ago

Not an answer so much as an empirical observation: using a context manager for file handling (which I think is generally recommended style) seems to solve this problem:

df = pd.DataFrame(data={'col1': [1, 2], 'col2': [3, 4]})

writer1 = pd.ExcelWriter('first.xlsx', engine='openpyxl')

with open('second.xlsx', 'wb') as output2:
    writer2 = pd.ExcelWriter(output2, engine='openpyxl')
kuraga commented 4 years ago

@lordgrenville , thanks! It even works in my case (it's complicated than this issue's code).

Then, I'll add your example, too and 'll mark this as a bug. If it's not, people'll say :)

kuraga commented 4 years ago

wwwald commented 3 years ago

I'm running into the same problem, or at least it seems very similar.

In my case, I'm trying to write a dataframe to an existing worksheet, using ExcelWriter's append mode. When opening the resulting file excel-results.xlsx, Excel (Office 365) warns me that it is corrupt and offers to repair. The repair does work, but of course, it shouldn't be necessary.

Some code to reproduce the problem:

import pandas as pd
from pathlib import Path
import shutil
from openpyxl import load_workbook

xlsx_template = Path("excel-template.xlsx")
xlsx_results = Path("excel-results.xlsx")
shutil.copy2(xlsx_template, xlsx_results)

df = pd.DataFrame(
    {"type": ["ERROR", "NOTFOUND", "ERROR"], 
     "message": ["First error message", "Didn't find a value", "Another error"]}

with pd.ExcelWriter(xlsx_results, engine="openpyxl", mode="a") as writer: = load_workbook(xlsx_results)
    writer.sheets = {ws.title: ws for ws in}
    df.to_excel(writer, sheet_name="Error messages", startrow=7, startcol=1, index=False, header=False)

The excel-template.xlsx file used here is this one.

Versions used to reproduce this:

nownc commented 3 years ago

Adding some flavour to the issue:

The behaviour is identical with what was already reported, but with the difference that Excel recovery cannot restore the records. This leaves a discrepancy of ~2K records between the dataframe and the Excel file created.

DF: [7271 rows x 20 columns] Excel: 5152 rows

There is no filtering, straight with pd.ExcelWriter(temp_name, engine='xlsxwriter') as writer: df_result_set.to_excel(writer, sheet_name='Dataset', index=False) writer.close()

The recovery message from Excel indicates functions have been removed, yet all the DF columns are text data, no Excel or udf functions referenced at all. Recovery log from Excel: error008280_01.xml

Errors were detected in file 'FY22Q1m.xlsx'Removed Records: Formula from /xl/worksheets/sheet1.xml part_.

Package Version

