Large dataframes containing strings generate invalid xlsx files

EdGaere commented 2 years ago

For large dataframes containing strings with 500k rows and 60+ columns, pyexcelerate generates invalid .xlsx files that cannot be opened in Excel. Excel gives the error message: "We found a problem with some content in yourfile.xlsx. Do you want us to try to recover as much as we can?". Answering yes removes all invalid content, which yields an empty worksheet.

from string import ascii_lowercase
from pandas import DataFrame
from pyexcelerate import Workbook
from uuid import uuid4

num_rows = 501000
num_cols = 64

# generate dataframe with column names
column_names = [ f"Col{idx}" for idx in range(0, num_cols)]
df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)), columns=column_names)

# write to file
output_filename = f"/tmp/{str(uuid4())}.xlsx"

wb = Workbook()
ws = wb.new_sheet("sheet name", data=df.values.tolist())
wb.save(output_filename)

The xlsx file is successfully written to disk, is 138Mb in size, but generates the above-mentionned error when being opened in Excel. Further inspection when opening the file in Python reveals that the Zip archive is invalid.

Interestingly, the problem only occurs for large dataframes filled with strings. The problem does not occur:

For smaller dataframes
For dataframes of the same size filled only with integers

Versions:

OS: macOS Catalina 10.15.7; Darwin Kernel Version 19.6.0
PyExcelerate==0.10.0
pandas==1.3.5

kevmo314 commented 2 years ago

Have you tried splitting across multiple worksheets? While technically allowed by the specification, Excel's behavior above 16384 columns is not guaranteed: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3

EdGaere commented 2 years ago

Hey Kevin,

Thanks for the quick response! I am only writing 64 columns... so that it way below Excel's 16K limit.

In terms of rows, the problem occurs at around 500K rows, which is also below Excel's 1M limit.

Edward

kevmo314 commented 2 years ago

Ah apologies, I misread the message. That's a bit odd, I wonder if the worksheets support having that many inline strings. The relevant line is here: https://github.com/kz26/PyExcelerate/blob/dev/pyexcelerate/Writer.py#L78

If the zip file being generated is invalid, then we can probably open it up and see what's being generated, hopefully something sticks out as incorrect. If you can take a look a contribution would be super welcome, otherwise I'll try to take a look sometime soon.

EdGaere commented 2 years ago

When writing the same dataframe with Pandas/openpyxl, there are no problems with the generated file. So it appears Excel can support the volume of inline strings.

Of course openpyxl takes 3x longer and consumes 10x more memory

kevmo314 commented 2 years ago

If I recall correctly, openpyxl creates a table of strings, it doesn't inline the strings. Therefore instead of writing 500000*60 strings, openpyxl will only write 60. PyExcelerate relies on the deflate algorithm to reduce the file size, we had profiled the string compression approach and maintaining that table took a lot longer than it was worth. So they're not quite directly comparable. I suspect the issue is probably somewhere in the produced worksheet xml.

EdGaere commented 2 years ago

Understood. I just realised that in the example I provided there are only 26 unique values of the strings (string.ascii_lowercase) across all 500'000 * 64 cells.

df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)),

bsam-parsionate commented 2 years ago

Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.

jonathancyu commented 2 years ago

I also have a large dataset that is generating invalid xlsx files, however I'm able to open and repair the files. Excel lists the repaired records as Repaired Records: String properties from /xl/worksheets/sheet1.xml part

ejhorow commented 2 years ago

I am having a similar problem. I am generating a very large file with about 8M rows and about 10 columns with a mix of short text strings (~10 char) and small integers. Resulting file is about 430MB. Trying to read the file in using openpyxl gives

zipfile.BadZipFile: File is not a zip file

Python 3.9.9 pyexcelerate: 0.10.0 openpyxl: 3.0.10

kevmo314 commented 2 years ago

Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.

This seems like an issue with the xml generation. I'm happy to review a pull request and it seems that a fix would be welcome as it affects quite a few people.

jonathancyu commented 2 years ago

Changing like 78 of Writer.py to with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f: allows those xml files to be written. Unfortunately opening the resulting file gives the error "Microsoft Excel is waiting for another application to complete an OLE action". You can get around this by killing and reopening excel, opening a blank sheet, and navigating to the file from the Document Recovery pane.

kevmo314 commented 2 years ago

I created a PR to add some logging, and add the force_zip64 change above. https://github.com/kz26/PyExcelerate/pull/178

If you're running into this issue, give that PR a shot and see if there's any additional debug information that gets produced. I suspect an error is being swallowed.

Nishad290 commented 2 years ago

This change works for me

with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:

And I can open the file in excel as well.

The issue seems to be there is a default/assumed limit of 2GB of the file by ZipFile library. From the documentation, I could find this

When writing a file, if the file size is not known in advance but may exceed 2 GiB, pass force_zip64=True to ensure that the header format is capable of supporting large files. If the file size is known in advance, construct a ZipInfo object with file_size set, and use that as the name parameter.

Meanwhile, without force_zip64=True I get this error - file size unexpectedly exceeded zip64 limit

stgcorpbr commented 2 years ago

I solved it by doing this: with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:

After the file is created I rename it to zip and then do it with this command on linux:

zip -F myfileexcel.zip --out myfileexcel.xlsx

it's not the best practice. But I believe someone will improve this.

SamiranTechPro commented 8 months ago

I have been facing the same problem by reviewing the dataset I found that if any column value start with = or - then this message comes up. Trying to escaping the = and - but In my scenarios that dosen't work.

kz26 / PyExcelerate

Large dataframes containing strings generate invalid xlsx files #167