Open maximrobi opened 2 years ago
Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)?
I assume you still lose your macro with the non-empty with-block?
If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file.
If the engine does not support some excel features, these features will probably get lost.
It would be good to emphasize that features not supported by the underlying engine (which I think is always openpyxl for mode="a") are lost in the mode section of ExcelWriter:
Do you have the same issue when using openpyxl directly (using it to read the file and then write it again)?
I assume you still lose your macro with the non-empty with-block?
If I understand the pandas code: pandas will use the engine (openpyxl in this case) to read the excel workbook, a user can append data to it (you choose not to, that should be okay - hopefully), and when pandas closes the file it asks the engine to write the potentially changed workbook back to the file.
If the engine does not support some excel features, these features will probably get lost.
My main problem is, that after executing the code (which should do nothing..) my file looks intact with .xlsm extension, but I can't open it since it got corrupted somehow. I need to manually change the extension to .xlsx to be able to at least read the info in the cells - but this way the macro is well and truly lost. I will try to replicate the same behavior with openpyxl tomorrow and get back to you.
Does it really choose a default engine based on the mode? I read that it's based on the filetype/extension:
Based on this, my .xlsm shouldn't even work - since it's not supported by any of the engines. I have no idea what happens when I execute the code. Can I debug somehow to see which engine is getting selected by default?
Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl).
The engine is inferred by extension and based on the content. You could add a breakpoint in your empty with
block and check writer.engine
Or just print:
with pd.ExcelWriter('test.xlsx', mode='a', engine='openpyxl') as writer:
print(writer.engine)
@maximrobi - can you post a file example before the corruption occurs?
@maximrobi - can you post a file example before the corruption occurs?
@rhshadrach You have it in the zip file under Issue Description. The _mytest - backup.xlsm should work 100%, just change the name to what the script is expecting to find.
Yes, you are right, it doesn't pick based on the mode (somehow I thought the only engine that supported it was openpyxl).
The engine is inferred by extension and based on the content. You could add a breakpoint in your empty
with
block and checkwriter.engine
@twoertwein Thanks for the info! I managed to debug the with statement and I can confirm that it is using openpyxl for my .xlsm file.
I can't duplicate on Linux. I wonder if the code in OpenpyxlWriter.save
is causing the issue. Upon the context manager exiting, we run
self.book.save(self.handles.handle)
if "r+" in self.mode and not isinstance(self.handles.handle, mmap.mmap):
# truncate file to the written content
self.handles.handle.truncate()
I haven't been able to discern what this truncate function is doing.
We always write the workbook back, even if no changes are made. That's how we lose features that are not supported by an engine.
I haven't been able to discern what this truncate function is doing.
The truncate shouldn't cause an issue here (it is actually needed). It is there in case the original file is larger than the new file (for example, if unsupported features are dropped by the engine). Otherwise, we would simply overwrite the first n-bits of the file and leave the remaining bits untouched, truncate removes those remaining bits.
we lose features that are not supported by an engine.
Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it? Maybe create a copy before manipulating it. I remember seeing some kind of warning for the Data Validation logic inside the same Excel document, but I couldn't replicate it since then.
I can't duplicate on Linux. I wonder if the code in
OpenpyxlWriter.save
is causing the issue.
Would this mean that this is a Windows-related bug? What was your end-result after running my code: a non-corrupt Excel file with .xlsm extension? Did it keep the macro function intact inside it?
@maximrobi - I got a valid Excel file without the macro.
Wouldn't it better to just pop an error (or at least a warning) that the filetype is not recognized and it'll remove some functionality from it?
That is currently the behavior: we raise if the type is not able to be determined and the engine is not explicitly specified. The issue here is that pandas determines the file is an xlsx because it is a zip file with the content "xl/workbook.xml". Is the conclusion here incorrect?
I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument keep_vba
. Running this on the test file here preserves the macro:
with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer:
pass
What happens when you run this on Windows @maximrobi?
In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.
I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument
keep_vba
. Running this on the test file here preserves the macro:with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer: pass
What happens when you run this on Windows @maximrobi?
In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.
I tried this and it results in the same corrupt file :(
I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument
keep_vba
. Running this on the test file here preserves the macro:with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer: pass
What happens when you run this on Windows @maximrobi?
In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.
I'm not really familiar what the difference between .xlsx and .xlsm is on a lower level, but you are correct in that they both have workbooks. Theoretically if you open an .xlsm file az zip, you should find the macro inside the vbaproject.bin file.
The most I could do with this was to modify pandas.io.excel._openpyxl.py.OpenpyxlWriter.__init__.book (line 65)
to directly add "keep_vba = True"
since this line does not include any motor_kwargs. And from what I tried this doesn't affect .xlsx files
However to append data to existing sheet, you should do something like:
path = './file.xlsm'
sheet_name = 'sheet_name'
adf = pd.read_excel(path, sheet_name=sheet_name, engine='openpyxl')
wb = load_workbook(path, keep_vba=True)
with pd.ExcelWriter(path, engine = 'openpyxl') as writer:
writer.book = wb
writer.sheets = {n: wb[n] for n in wb.sheetnames}
df.to_excel(writer, sheet_name = sheetname, index=False, header=False, startrow=len(adf)+1)
or modify again _openpyxl.py
to add something like an "append"
option to def write_cells()
@maximrobi I have had the same issue. I tried each solution in this thread and others found on stack overflow https://stackoverflow.com/questions/28169731/write-pandas-dataframe-to-xlsm-file-excel-with-macros-enabled. I have found a workaround using xlwings to save the file as a xlsm file after the data frame has been written to it.
` import os
import xlswriter import pandas as pd from xlwings import Book
df = pd.DataFrame(columns=['one', 'two'], data=[[1,2], [3,4]])
filename = 'TEST.xlsx' filename_macro = 'TEST.xlsm'
with pd.ExcelWriter(filename, engine='xlsxwriter') as writer: df.to_excel(writer) writer.save()
book = Book(filename) book.save(filename_macro) book.close()
os.remove(filename) os.system("TASKKILL /F /IM Excel.exe") `
The first part of the code is standard writing to an xlsx file in pandas. The second part is using the xlwings Book class constructor to open the xlsx file inside your directory and then saving it with the xlsm file extension. In order to remove the redundant xlsx file, using os module to delete it from the system.
There is a minor bug that when the Book() constructor is called and then closed, the excel application opens and doesnt terminate. It stays open as a minimised window, hence the os module is required again to terminate the process using the system function.
I am not very experienced and still at university so maybe my implementation is messy or unnecessarily complicated, but it worked for me after hours of headache so I hope it helps.
I also did not add in that I create the workbook excel file using xlsxwriter first. This may be redundant when using the pandas module however.
What I do to create a valid .xlsm file directly in python is also with openpyxl I do something like
from openpyxl import Workbook, load_workbook
path = './book.xlsm'
wb = Workbook()
wb.save(path)
wb = load_workbook(path, keep_vba=True)
wb.save(path)
wb.close()
The first part with Workbook()
creates the file, but it is corrupted, however, when you open it with load_workbook
and the keep_vba=True
arg, the file is magically fixed and ready to be manipulated.
@maximrobi Is this resolved?
I just ran into the issue initially before setting the keep_vba
parameter in engine_kwargs
. Everything seems to work properly now (this is on Windows btw)
I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument
keep_vba
. Running this on the test file here preserves the macro:with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer: pass
What happens when you run this on Windows @maximrobi?
In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.
+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the book
with openpyxl.load_workbook
in pre-1.5 to using the file copy + mode="a"
approach in post-1.5 as discussed in #48780
What I do to create a valid .xlsm file directly in python is also with openpyxl I do something like
from openpyxl import Workbook, load_workbook path = './book.xlsm' wb = Workbook() wb.save(path) wb = load_workbook(path, keep_vba=True) wb.save(path) wb.close()
The first part with
Workbook()
creates the file, but it is corrupted, however, when you open it withload_workbook
and thekeep_vba=True
arg, the file is magically fixed and ready to be manipulated.
Wow, maybe I should try this at my end as well - although I already changed my workstation and maybe a newer version of Pandas already has a fix for my problem. I'll let you know if I manage to get any results!
@maximrobi Is this resolved?
I just ran into the issue initially before setting the
keep_vba
parameter inengine_kwargs
. Everything seems to work properly now (this is on Windows btw)I think I understand now. xlsm / xlsb files can contain macros, whereas xlsx cannot. Openpyxl allows the engine argument
keep_vba
. Running this on the test file here preserves the macro:with pd.ExcelWriter(excel_path, mode='a', engine_kwargs={'keep_vba': True}) as writer: pass
What happens when you run this on Windows @maximrobi? In order to have a warning, we'd need to inspect the file as well as the engine_kwargs to determine if macros will be lost, warning when they are. +1 on adding such a warning.
+1 to issuing a warning as well. Without a warning, this could be an issue as users convert from directly setting the
book
withopenpyxl.load_workbook
in pre-1.5 to using the file copy +mode="a"
approach in post-1.5 as discussed in #48780
Not sure what a resolution would be to this issue, maybe a confirmation from someone that a warning was implemented and nobody loses 2 days' worth of scripting in VBA due to a mini-automatization script in Python :)) [and I learned that you never run a script with original input of a file..]
So no, as far as I know, this issue is still active. I will try it once again on my new machine and latest pandas version to see how it behaves.
I wonder how issue has been resolved? I previously resolved same issue by installing pandas 1.4.4 version. Now, with v1.5.2 gives same corrupted excel file error Magic line in v1.5.2 with pd.ExcelWriter(new_path03xl, engine='openpyxl', mode='r+', if_sheet_exists='overlay',engine_kwargs={'keep_vba': True}) as writer: book = load_workbook(new_path03xl, keep_vba=True)
Do not save the writer, close it:
writer.close()
Worked for me with below code
with pd.ExcelWriter(output_path) as writer:
df2.to_excel(writer, sheet_name="v2", index=False)
for sheet_name in ["v2"]:
ws = writer.sheets[sheet_name]
ws.column_dimensions['A'].width = 247 / 6
ws.column_dimensions['B'].width = 78 / 6
ws.column_dimensions['C'].width = 99 / 6
ws.column_dimensions['D'].width = 208 / 6
ws.column_dimensions['E'].width = 140 / 6
ws.column_dimensions['F'].width = 200 / 6
ws.column_dimensions['G'].width = 126 / 6
# writer.close() # this line is the root cause for me
Passing the keep_vba param in combination with openpyxl indeed solved the problem.
with pd.ExcelWriter(path, engine = 'openpyxl', engine_kwargs = {'keep_vba': True}, mode = 'a', if_sheet_exists = 'replace') as writer:
df.to_excel(writer, sheet_name='df')
Thanks @ejabu and @iiosenov. I think the main issue has been solved by #52214 - namely users can now pass keep_vba=True
if they desire. But this seems to result in a corrupted file if not passed.
I'm wondering what the negative side effects are of having keep_vba=True
as the default.
Hi @rhshadrach, and thank you for the summary! Yes, the issue is solved with the latest releases available in PyCharm (checked on my new workstation with Python3.10, Pandas 2.1.4 and Openpyxl 3.1.2).
Since I also tested it with the original versions from this ticket and it still results in a corrupt Excel document, is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug.
P.S: The ticket you linked mainly talks about .xls support - but in my case I only had problems due to the added macro (meaning .xlsm), everything worked okay on .xlsx documents. I double checked it right now - not sure if it makes a difference, but I felt a need to specify.
is there any way we can backtrack and at least print a warning in the console? I know a lot of production code can't upgrade to newest Python/Pandas, so we could save them from losing valuable data due to this bug.
pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar.
pandas typically only releases new patches for the previous minor version (2.1 right now). We do not have the resources to support older versions in general. An issue has to be truly exceptional in order for us to consider patching older versions. I do not think this issue meets that bar.
Understood. Then I think we can close this issue - and as a solution I will either create a copy of my input documents before running ANYTHING on them, or just use the newer Pandas versions in the future. Thanks again!
I still think we should explore having keep_vba
default to True. I'd like to keep this issue open until that determination happens.
[X] I have checked that this issue has not already been reported.
[X] I have confirmed this bug exists on the latest version of pandas.
[ ] I have confirmed this bug exists on the master branch of pandas.
Reproducible Example
Issue Description
The specified Excel file gets corrupted, despite the append mode. Changing its extension to .xlsx can retrieve the raw data from it, but the macro is forever gone.
In my example I used with and had only a pass statement inside it, yet the file became useless.
Excel_macro_corruption.zip
Q1: Can we know for sure that the default engine is used for .xlsm files? Q2: Should I contact OpenPyxl instead of you with this issue?
Expected Behavior
As I've read since my first encounter of this bug, ExcelWriter doesn't support macros, but a warning of some kind would be nice before wiping out a carefully written VBA code from the Excel file.
Installed Versions