qrilka / xlsx

Simple and incomplete Excel file parser/writer
MIT License
132 stars 64 forks source link

Cannot read xlsx produce from apache POI #179

Closed dten closed 2 months ago

dten commented 2 months ago

POI Always writes zip format 4.5 files which zip-archive cannot read properly

https://github.com/apache/poi/blob/trunk/poi-ooxml/src/main/java/org/apache/poi/xssf/streaming/Zip64Impl.java#L88

technically this will also mean it cannot read any xlsx that requires zip64, so, very large (> 4GB) xslx

edit i see a comment in the code about zipOpt64 🤔 but that only fixes writing

qrilka commented 2 months ago

I guess that could be a ticket for zip-archive?

dten commented 2 months ago

yea it feels like it though zip-archive specifically says it does not support zip64 so I wasn't expecting them to entertain it

dten commented 2 months ago

the only haskell package I can see that entertains zip64 properly seems to be zip but it really wants to use the disk and never memory for its work (for performance reasons), which would be a very significant change for xlsx.

I'll go try my best with zip-archive 🫡 thanks

qrilka commented 2 months ago

@dten please let me know about how it goes, if it's problematic then I suppose it's worth to document this as a current implementation of the library

dten commented 2 months ago

So after much searching I've kind of decided that POI's outputs are weird. From version 5 they decided to always write out zip64 markers even when they're not needed. Excel will remove them if you open a file and save it again. They seem to have done it to use a new streaming output interface but the trade off seems to be most zip libraries complain about the archive. Windows thinks the archive is corrupt if you try modify it as a zip, unix zip complains the version markers are incorrect.

we ended up finding when using poi we can override that default and then it produces a file that zip-archive can read

workbook.setZip64Mode(Zip64Mode.AsNeeded);

this is enough for our case

they've even said it's the reader's fault previously

https://bz.apache.org/bugzilla/show_bug.cgi?id=66431