Open EdGaere opened 2 years ago
Have you tried splitting across multiple worksheets? While technically allowed by the specification, Excel's behavior above 16384 columns is not guaranteed: https://support.microsoft.com/en-us/office/excel-specifications-and-limits-1672b34d-7043-467e-8e27-269d656771c3
Hey Kevin,
Thanks for the quick response! I am only writing 64 columns... so that it way below Excel's 16K limit.
In terms of rows, the problem occurs at around 500K rows, which is also below Excel's 1M limit.
Edward
Ah apologies, I misread the message. That's a bit odd, I wonder if the worksheets support having that many inline strings. The relevant line is here: https://github.com/kz26/PyExcelerate/blob/dev/pyexcelerate/Writer.py#L78
If the zip file being generated is invalid, then we can probably open it up and see what's being generated, hopefully something sticks out as incorrect. If you can take a look a contribution would be super welcome, otherwise I'll try to take a look sometime soon.
When writing the same dataframe with Pandas/openpyxl, there are no problems with the generated file. So it appears Excel can support the volume of inline strings.
Of course openpyxl takes 3x longer and consumes 10x more memory
If I recall correctly, openpyxl creates a table of strings, it doesn't inline the strings. Therefore instead of writing 500000*60 strings, openpyxl will only write 60. PyExcelerate relies on the deflate algorithm to reduce the file size, we had profiled the string compression approach and maintaining that table took a lot longer than it was worth. So they're not quite directly comparable. I suspect the issue is probably somewhere in the produced worksheet xml.
Understood. I just realised that in the example I provided there are only 26 unique values of the strings (string.ascii_lowercase) across all 500'000 * 64 cells.
df = DataFrame(random.choice(list(ascii_lowercase), size=(num_rows, num_cols)),
Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.
I also have a large dataset that is generating invalid xlsx files, however I'm able to open and repair the files. Excel lists the repaired records as
Repaired Records: String properties from /xl/worksheets/sheet1.xml part
I am having a similar problem. I am generating a very large file with about 8M rows and about 10 columns with a mix of short text strings (~10 char) and small integers. Resulting file is about 430MB. Trying to read the file in using openpyxl gives
zipfile.BadZipFile: File is not a zip file
Python 3.9.9 pyexcelerate: 0.10.0 openpyxl: 3.0.10
Is there a fix or workaround for this yet? I am also facing the same issue. In my case, I have around 1.4M rows. So I am splitting them into two. I am writing the first 750K in one sheet and the rest in the second sheet. But it generates invalid excel and when MSExcel tries to restore the data, it shows both sheets completely empty.
This seems like an issue with the xml generation. I'm happy to review a pull request and it seems that a fix would be welcome as it affects quite a few people.
Changing like 78 of Writer.py to
with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:
allows those xml files to be written. Unfortunately opening the resulting file gives the error "Microsoft Excel is waiting for another application to complete an OLE action". You can get around this by killing and reopening excel, opening a blank sheet, and navigating to the file from the Document Recovery pane.
I created a PR to add some logging, and add the force_zip64
change above. https://github.com/kz26/PyExcelerate/pull/178
If you're running into this issue, give that PR a shot and see if there's any additional debug information that gets produced. I suspect an error is being swallowed.
This change works for me
with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:
And I can open the file in excel as well.
The issue seems to be there is a default/assumed limit of 2GB of the file by ZipFile library. From the documentation, I could find this
When writing a file, if the file size is not known in advance but may exceed 2 GiB, pass force_zip64=True to ensure that the header format is capable of supporting large files. If the file size is known in advance, construct a ZipInfo object with file_size set, and use that as the name parameter.
Meanwhile, without force_zip64=True
I get this error - file size unexpectedly exceeded zip64 limit
I solved it by doing this:
with zf.open("xl/worksheets/sheet%s.xml" % (index), mode="w", force_zip64=True) as f:
After the file is created I rename it to zip and then do it with this command on linux:
zip -F myfileexcel.zip --out myfileexcel.xlsx
it's not the best practice. But I believe someone will improve this.
I have been facing the same problem by reviewing the dataset I found that if any column value start with = or - then this message comes up. Trying to escaping the = and - but In my scenarios that dosen't work.
For large dataframes containing strings with 500k rows and 60+ columns, pyexcelerate generates invalid .xlsx files that cannot be opened in Excel. Excel gives the error message: "We found a problem with some content in yourfile.xlsx. Do you want us to try to recover as much as we can?". Answering yes removes all invalid content, which yields an empty worksheet.
The xlsx file is successfully written to disk, is 138Mb in size, but generates the above-mentionned error when being opened in Excel. Further inspection when opening the file in Python reveals that the Zip archive is invalid.
Interestingly, the problem only occurs for large dataframes filled with strings. The problem does not occur:
Versions: