python / cpython

The Python programming language
https://www.python.org
Other
63.34k stars 30.32k forks source link

zipfile detection of when to write a zip64 header should be made accurate #113931

Open gpshead opened 9 months ago

gpshead commented 9 months ago

Bug report

Proposal:

Today our zipfile module internal implementation uses a heuristic dance to determine when a zip64 header is likely to be required between zipfile.ZipFile._open_to_write() and zipfile._ZipWriteFile.close().

This seems rather silly. Any the time zipfile._ZipWriteFile.close() is called, we know the real uncompressed and compressed data sizes and can deterministically decide at that time. Instead of the existing heuristic of "if the expected input file_size * 1.05 > ZIP64_LIMIT" used within _open_to_write() today.

The only time we should ever raise an exception regarding zip64 being requires is if the API user has explicitly forbidden zip64's use.

I wouldn't backport this change to a stable release as it will alter the exact output produced in some circumstances (zip64 headers will no longer be added unnecessarily in borderline cases where they were not needed), but it is fair to consider it more of a bug that removes an odd API internal implementation wart as well as a feature.

Has this already been discussed elsewhere?

This is a minor feature, which does not need previous discussion elsewhere

Links to previous discussion of this feature:

No response

gpshead commented 9 months ago

Amending: This is not always true as implemented. _open_to_write needs to add the initial inline zip file header as the header size depends on zip64 or not before a _ZipWriteFile is created to fill in the data which merely seeks back and updates the same header to fill in the CRC and sizes.

To get out of the heuristic business we either need to: A. always write a zip64 header B. handle the rare boundary condition when the compressed data winds up larger than uncompressed specially by rewriting things in that situation C. give up on the zip format shenanigans, keep our heuristic, and go shopping.

gpshead commented 9 months ago

investigating which approaches other zip creation tools use would be informative rather than reinventing the wheels here.

serhiy-storchaka commented 9 months ago

Always writing a zip64 header is inefficient for small files.