pandas-dev / pandas

Flexible and powerful data analysis / manipulation library for Python, providing labeled data structures similar to R data.frame objects, statistical functions, and much more
https://pandas.pydata.org
BSD 3-Clause "New" or "Revised" License
43.22k stars 17.78k forks source link

BUG: "with pd.ExcelWriter" produces a corrupt Excel file in case of .xlsm extension #44868

Open maximrobi opened 2 years ago

maximrobi commented 2 years ago

Reproducible Example

import pandas as pd

excel_path = r"D:\my_test.xlsm"
with pd.ExcelWriter(excel_path, mode='a') as writer:
    # df.to_excel(writer, sheet_name='Sheet1')
    pass

df_excel = pd.read_excel(excel_path, 'Sheet1')

Issue Description

The specified Excel file gets corrupted, despite the append mode. Changing its extension to .xlsx can retrieve the raw data from it, but the macro is forever gone.

In my example I used with and had only a pass statement inside it, yet the file became useless.

Excel_macro_corruption.zip

Q1: Can we know for sure that the default engine is used for .xlsm files? Q2: Should I contact OpenPyxl instead of you with this issue?

Expected Behavior

As I've read since my first encounter of this bug, ExcelWriter doesn't support macros, but a warning of some kind would be nice before wiping out a carefully written VBA code from the Excel file.

Installed Versions

INSTALLED VERSIONS ------------------ commit : 945c9ed766a61c7d2c0a7cbb251b6edebf9cb7d5 python : 3.7.4.final.0 python-bits : 64 OS : Windows OS-release : 10 Version : 10.0.19041 machine : AMD64 processor : Intel64 Family 6 Model 158 Stepping 9, GenuineIntel byteorder : little LC_ALL : None LANG : None LOCALE : None.None pandas : 1.3.4 numpy : 1.17.3 pytz : 2021.3 dateutil : 2.8.2 pip : 19.3.1 setuptools : 40.8.0 Cython : None pytest : None hypothesis : None sphinx : None blosc : None feather : None xlsxwriter : None lxml.etree : None html5lib : None pymysql : None psycopg2 : None jinja2 : None IPython : None pandas_datareader: None bs4 : None bottleneck : None fsspec : None fastparquet : None gcsfs : None matplotlib : None numexpr : None odfpy : None openpyxl : 3.0.9 pandas_gbq : None pyarrow : None pyxlsb : None s3fs : None scipy : None sqlalchemy : None tables : None tabulate : None xarray : None xlrd : None xlwt : None numba : None
twoertwein commented 2 years ago

Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)?

I assume you still lose your macro with the non-empty with-block?

If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file.

If the engine does not support some excel features, these features will probably get lost.

twoertwein commented 2 years ago

It would be good to emphasize that features not supported by the underlying engine (which I think is always openpyxl for mode="a") are lost in the mode section of ExcelWriter: image

maximrobi commented 2 years ago

Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)?

I assume you still lose your macro with the non-empty with-block?

If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file.

If the engine does not support some excel features, these features will probably get lost.

My main problem is, that after executing the code (which should do nothing..) my file looks intact with .xlsm extension, but I can't open it since it got corrupted somehow. I need to manually change the extension to .xlsx to be able to at least read the info in the cells - but this way the macro is well and truly lost. I will try to replicate the same behavior with openpyxl tomorrow and get back to you.

maximrobi commented 2 years ago

Does it really choose a default engine based on the mode? I read that it's based on the filetype/extension: image

Based on this, my .xlsm shouldn't even work - since it's not supported by any of the engines. I have no idea what happens when I execute the code. Can I debug somehow to see which engine is getting selected by default?

twoertwein commented 2 years ago

Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl).

The engine is inferred by extension and based on the content. You could add a breakpoint in your empty with block and check writer.engine

rhshadrach commented 2 years ago

Or just print:

with pd.ExcelWriter('test.xlsx', mode='a', engine='openpyxl') as writer:
    print(writer.engine)
rhshadrach commented 2 years ago

@maximrobi - can you post a file example before the corruption occurs?

maximrobi commented 2 years ago

@maximrobi - can you post a file example before the corruption occurs?

@rhshadrach You have it in the zip file under Issue Description. The _mytest - backup.xlsm should work 100%, just change the name to what the script is expecting to find.

maximrobi commented 2 years ago

Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl).

The engine is inferred by extension and based on the content. You could add a breakpoint in your empty with block and check writer.engine

@twoertwein Thanks for the info! I managed to debug the with statement and I can confirm that it is using openpyxl for my .xlsm file.

rhshadrach commented 2 years ago

I can't duplicate on Linux. I wonder if the code in OpenpyxlWriter.save is causing the issue. Upon the context manager exiting, we run

        self.book.save(self.handles.handle)
        if "r+" in self.mode and not isinstance(self.handles.handle, mmap.mmap):
            # truncate file to the written content
            self.handles.handle.truncate()

I haven't been able to discern what this truncate function is doing.

twoertwein commented 2 years ago

We always write the workbook back, even if no changes are made. That's how we lose features that are not supported by an engine.

I haven't been able to discern what this truncate function is doing.

The truncate shouldn't cause an issue here (it is actually needed). It is there in case the original file is larger than the new file (for example, if unsupported features are dropped by the engine). Otherwise, we would simply overwrite the first n-bits of the file and leave the remaining bits untouched, truncate removes those remaining bits.

maximrobi commented 2 years ago

we lose features that are not supported by an engine.

Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it? Maybe create a copy before manipulating it. I remember seeing some kind of warning for the Data Validation logic inside the same Excel document, but I couldn't replicate it since then.

maximrobi commented 2 years ago

I can't duplicate on Linux. I wonder if the code in OpenpyxlWriter.save is causing the issue.

Would this mean that this is a Windows-related bug? What was your end-result after running my code: a non-corrupt Excel file with .xlsm extension? Did it keep the macro function intact inside it?

rhshadrach commented 2 years ago

@maximrobi - I got a valid Excel file without the macro.

Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it?

That is currently the behavior: we raise if the type is not able to be determined and the engine is not explicitly specified. The issue here is that pandas determines the file is an xlsx because it is a zip file with the content "xl/workbook.xml". Is the conclusion here incorrect?

rhshadrach commented 2 years ago

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

maximrobi commented 2 years ago

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

I tried this and it results in the same corrupt file :(

maximrobi commented 2 years ago

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

I'm not really familiar what the difference between .xlsx and .xlsm is on a lower level, but you are correct in that they both have workbooks. Theoretically if you open an .xlsm file az zip, you should find the macro inside the vbaproject.bin file.

AlejoB13C commented 2 years ago

The most I could do with this was to modify pandas.io.excel._openpyxl.py.OpenpyxlWriter.__init__.book (line 65) to directly add "keep_vba = True" since this line does not include any motor_kwargs. And from what I tried this doesn't affect .xlsx files

However to append data to existing sheet, you should do something like:

path = './file.xlsm'
sheet_name = 'sheet_name'
adf = pd.read_excel(path, sheet_name=sheet_name, engine='openpyxl')
wb = load_workbook(path, keep_vba=True)

with pd.ExcelWriter(path, engine = 'openpyxl') as writer:
    writer.book = wb
    writer.sheets = {n: wb[n] for n in wb.sheetnames}
    df.to_excel(writer, sheet_name = sheetname, index=False, header=False, startrow=len(adf)+1)

or modify again _openpyxl.py to add something like an "append" option to def write_cells()

ryanBeattie commented 2 years ago

@maximrobi I have had the same issue. I tried each solution in this thread and others found on stack overflow https://stackoverflow.com/questions/28169731/write-pandas-dataframe-to-xlsm-file-excel-with-macros-enabled. I have found a workaround using xlwings to save the file as a xlsm file after the data frame has been written to it.

` import os

import xlswriter import pandas as pd from xlwings import Book

df = pd.DataFrame(columns=['one', 'two'], data=[[1,2], [3,4]])

filename = 'TEST.xlsx' filename_macro = 'TEST.xlsm'

with pd.ExcelWriter(filename, engine='xlsxwriter') as writer: df.to_excel(writer) writer.save()

book = Book(filename) book.save(filename_macro) book.close()

os.remove(filename) os.system("TASKKILL /F /IM Excel.exe") `

The first part of the code is standard writing to an xlsx file in pandas. The second part is using the xlwings Book class constructor to open the xlsx file inside your directory and then saving it with the xlsm file extension. In order to remove the redundant xlsx file, using os module to delete it from the system.

There is a minor bug that when the Book() constructor is called and then closed, the excel application opens and doesnt terminate. It stays open as a minimised window, hence the os module is required again to terminate the process using the system function.

I am not very experienced and still at university so maybe my implementation is messy or unnecessarily complicated, but it worked for me after hours of headache so I hope it helps.

I also did not add in that I create the workbook excel file using xlsxwriter first. This may be redundant when using the pandas module however.

AlejoB13C commented 2 years ago

What I do to create a valid .xlsm file directly in python is also with openpyxl I do something like

from openpyxl import Workbook, load_workbook

path = './book.xlsm'
wb = Workbook()
wb.save(path)

wb = load_workbook(path, keep_vba=True)
wb.save(path)
wb.close()

The first part with Workbook() creates the file, but it is corrupted, however, when you open it with load_workbook and the keep_vba=True arg, the file is magically fixed and ready to be manipulated.

tehunter commented 1 year ago

@maximrobi Is this resolved?

I just ran into the issue initially before setting the keep_vba parameter in engine_kwargs. Everything seems to work properly now (this is on Windows btw)

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi?

In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the book with openpyxl.load_workbook in pre-1.5 to using the file copy + mode="a" approach in post-1.5 as discussed in #48780

maximrobi commented 1 year ago

What I do to create a valid .xlsm file directly in python is also with openpyxl I do something like

from openpyxl import Workbook, load_workbook

path = './book.xlsm'
wb = Workbook()
wb.save(path)

wb = load_workbook(path, keep_vba=True)
wb.save(path)
wb.close()

The first part with Workbook() creates the file, but it is corrupted, however, when you open it with load_workbook and the keep_vba=True arg, the file is magically fixed and ready to be manipulated.

Wow, maybe I should try this at my end as well - although I already changed my workstation and maybe a newer version of Pandas already has a fix for my problem. I'll let you know if I manage to get any results!

maximrobi commented 1 year ago

@maximrobi Is this resolved?

I just ran into the issue initially before setting the keep_vba parameter in engine_kwargs. Everything seems to work properly now (this is on Windows btw)

I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba. Running this on the test file here preserves the macro:

with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
    pass

What happens when you run this on Windows @maximrobi? In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.

+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the book with openpyxl.load_workbook in pre-1.5 to using the file copy + mode="a" approach in post-1.5 as discussed in #48780

Not sure what a resolution would be to this issue, maybe a confirmation from someone that a warning was implemented and nobody loses 2 days' worth of scripting in VBA due to a mini-automatization script in Python :)) [and I learned that you never run a script with original input of a file..]

So no, as far as I know, this issue is still active. I will try it once again on my new machine and latest pandas version to see how it behaves.

eldarmammadov commented 1 year ago

I wonder how issue has been resolved? I previously resolved same issue by installing pandas 1.4.4 version. Now, with v1.5.2 gives same corrupted excel file error Magic line in v1.5.2 with pd.ExcelWriter(new_path03xl, engine='openpyxl', mode='r+', if_sheet_exists='overlay',engine_kwargs={'keep_vba': True}) as writer: book = load_workbook(new_path03xl, keep_vba=True)

sistaninia commented 1 year ago

Do not save the writer, close it:

writer.close()

ejabu commented 12 months ago

Worked for me with below code

with pd.ExcelWriter(output_path) as writer:
        df2.to_excel(writer, sheet_name="v2", index=False)
        for sheet_name in ["v2"]:
            ws = writer.sheets[sheet_name]
            ws.column_dimensions['A'].width = 247 / 6
            ws.column_dimensions['B'].width = 78 / 6
            ws.column_dimensions['C'].width = 99 / 6
            ws.column_dimensions['D'].width = 208 / 6
            ws.column_dimensions['E'].width = 140 / 6
            ws.column_dimensions['F'].width = 200 / 6
            ws.column_dimensions['G'].width = 126 / 6
        # writer.close() # this line is the root cause for me
iiosenov commented 11 months ago

Passing the keep_vba param in combination with openpyxl indeed solved the problem.

with pd.ExcelWriter(path, engine = 'openpyxl', engine_kwargs = {'keep_vba': True}, mode = 'a', if_sheet_exists = 'replace') as writer:
    df.to_excel(writer, sheet_name='df')
rhshadrach commented 11 months ago

Thanks @ejabu and @iiosenov. I think the main issue has been solved by #52214 - namely users can now pass keep_vba=True if they desire. But this seems to result in a corrupted file if not passed.

I'm wondering what the negative side effects are of having keep_vba=True as the default.

maximrobi commented 7 months ago

Hi @rhshadrach, and thank you for the summary! Yes, the issue is solved with the latest releases available in PyCharm (checked on my new workstation with Python3.10, Pandas 2.1.4 and Openpyxl 3.1.2).

Since I also tested it with the original versions from this ticket and it still results in a corrupt Excel document, is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug.

P.S: The ticket you linked mainly talks about .xls support - but in my case I only had problems due to the added macro (meaning .xlsm), everything worked okay on .xlsx documents. I double checked it right now - not sure if it makes a difference, but I felt a need to specify.

rhshadrach commented 7 months ago

is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug.

pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar.

maximrobi commented 7 months ago

pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar.

Understood. Then I think we can close this issue - and as a solution I will either create a copy of my input documents before running ANYTHING on them, or just use the newer Pandas versions in the future. Thanks again!

rhshadrach commented 7 months ago

I still think we should explore having keep_vba default to True. I'd like to keep this issue open until that determination happens.