Closed c71cc42f-ce78-43ef-be91-7ae48fe4e59b closed 7 years ago
In Python 2.7.13, using zipfile.ZipFile to write into a file with some initial preamble produces a zip file that cannot be read again by some zip implementations. Our use case is using pex (https://github.com/pantsbuild/pex) which writes a zip that begins with a shebang, and later attempting to manipulate that using Go's standard archive/zip package. In 2.7.12 that works OK, but in 2.7.13 the .pex file is rejected on reading. Linux's command-line unzip tool will read the archive, but issues a warning ("4 extra bytes at beginning or within zipfile") which wasn't present previously. zipfile.ZipFile does read the files OK.
I assume this is related to https://bugs.python.org/issue26293 since that's the most obvious zipfile change in 2.7.13. It's pretty easy to reproduce using the example in that issue:
from zipfile import ZipFile
with open('a.zip', 'wb') as base:
base.write(b'old\n')
with ZipFile(base, 'a') as myzip:
myzip.write('eggs.txt')
unzip -t a.zip Archive: a.zip warning [a.zip]: 4 extra bytes at beginning or within zipfile (attempting to process anyway) ...
The specification of ZIP files is not pretty clear, but the unzip utility first try to interpret offsets relatively to the start of the archive that can be different from the start of the file. Go's standard archive/zip package interprets offsets as absolute file positions and doesn't support concatenated ZIP files.
In any case there is a regression. Proposed patch partially reverts bpo-26293. Offsets in created ZIP file are now absolute when create with modes 'w' and 'x'. But they are relative when create with mode 'a'. This should fix a regression in pex, but keep the case of bpo-26293.
I'm going to push the patch before tagging 3.5.3 rc1.
Is there a reason to ever not use relative offsets? It seems that's strictly more general the absolute.
Actually all offsets are relative to some point, just this point not always at the start of the file. If concatenate the ZIP file to other file, the unzip utility and the zipfile module are able to infer the starting point and correct offsets. Thus there is no matter what is the starting point of offsets. But Go's standard archive/zip package works only if offsets are relative to the start of the file. I don't know how common such interpretation however.
New changeset f5aa1c9c2b7e by Serhiy Storchaka in branch '3.5': Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes https://hg.python.org/cpython/rev/f5aa1c9c2b7e
New changeset 342bc734f523 by Serhiy Storchaka in branch '2.7': Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes https://hg.python.org/cpython/rev/342bc734f523
New changeset f36f9bce997d by Serhiy Storchaka in branch '3.6': Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes https://hg.python.org/cpython/rev/f36f9bce997d
If this is fixed, can we close this issue? This release blocker is one of two issues blocking 3.5.3 rc1.
New changeset a80c14ace927 by Serhiy Storchaka in branch 'default': Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes https://hg.python.org/cpython/rev/a80c14ace927
For the record, this new behaviour is wrong. It's not immediately obvious from the ZIP spec, but offsets have to be from the start of the file, not the archive, or ZIP64 can't work. And yes, that means you can't blindly concatenate ZIP files to other files, you have to rewrite them.
The way to create an 'embedded' ZIP archive like requested in issue bpo-26293 is to write it to a separate file (or file-like object). Making "a" have this magical and wrong-for-almost-anyone meaning is... confusing, to say the least. The change certainly doesn't belong in a bugfix release, but I guess it's too late for that now.
It's also not documented; the docs mention this:
If mode is 'a' and file refers to an existing ZIP file, then additional files are added to it. If file does not refer to a ZIP file, then a new ZIP archive is appended to the file. This is meant for adding a ZIP archive to another file (such as python.exe).
which is ambiguous at best and certainly suggests the resulting file would be usable as-is, which is not the case if you use mode "a" (but *is* the case if you use mode "w", since 342bc734f523).
re-opening since, though released, this behavior change broke existing code.
release managers need to decide.
What's to decide? If the new behavior is also broken, we should fix it. I'd like a fix in the next 3.5.
Anything breaking 2.7 should be reverted. I take it, the change from bpo-26293 should be as well as this one?
Yhg1s, the commit message for bpo-26293 included the sentence "Offsets in ZIP file now are relative to the start of the archive in conforming to the specification." Can someone point to the line in the spec that covers this?
https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
There is no explicit line, but offsets always are named relative: "relative offset of local header", "offset of start of central directory with respect to the starting disk number". This looks not pretty clear, but open source utilities (unzip, p7zip) support ZIP archives started not from the start of the file.
The spec isn't very explicit about it, yes, but it does say this:
4.4.16 relative offset of local header: (4 bytes)
This is the offset from the start of the first disk on
which this file appears, to where the local header should
be found.
"the start of the first disk" could be construed to mean "the start of the ZIP archive embedded in this file". However, if you consider the information that's available, the only way to make ZIP archives work in the face of ZIP64 and other extensions that add information between the end-of-central-directory record and the end of the central directory, it's obvious that you can't correctly handle ZIP archives that start at an arbitrary point in a file.
ZIP archives have both a 4-byte magic number at the start, and a central directory at the end. The end-of-central-directory record is the very last thing in the file, and it records both the offset of the start of the central directory and the size of the central directory. In absense of any ZIP extensions that add records between the end-of-central-directory record and the end of the central directory, you can use those to correct all offsets in the ZIP archive. But as soon as you add (for example) ZIP64 records, this no longer works: ZIP64 has an end-of-zip64-central-directory locator, and variable-sized end-of-zip64-central-directory record. The locator is fixed size right before the end-of-central-directory record and records the offset (from the start of the file) to the end-of-zip64-central-directory record, but *not* the size of that record or any other information you can use to determine the offset of the start of the archive in the file.
Only by assuming the central directory record comes right before the end-of-central-directory record, or assuming fixed sizes for the ZIP64 record, can you deal with ZIP archives with offsets not from the start of the file. This assumption is not only *not* guaranteed by the ZIP spec, it's explicitly invalidated by ZIP64's variable sized records, and possibly other extensions (like encryption, compression and digital signatures, although I don't remember if those actually affect this).
It's true that many ZIP tools try to deal with these kinds of archives, although they *do realise it's wrong and they usually *do warn about it. They still can't deal with it if it uses variable-sized ZIP64 features (other than trawling through the file looking for the 4-byte magic numbers).
Here's an example of code that breaks because of this: https://github.com/Yhg1s/zipfile-hacks. I tried to convince zipfile to create Zip64 files with extra fields (the variable-sized parts) but unfortunately the *cough* "design" of the zipfile module doesn't allow that -- feel free to ignore the force_zip64 parts of the script.
(I'm using two python installations I had laying around here; I could've used 2.7.12 vs 2.7.13 instead, and the results would be the same.)
# Python 2.7.12 -- so old behaviour % python create_small_zip64.py -v --mode w --preamble '#!/usr/bin/python' py2-preamble-w.zip create_small_zip64.py % python create_small_zip64.py -v --mode a --preamble '#!/usr/bin/python' py2-preamble-a.zip create_small_zip64.py
# Python 3.6.0+ -- after this change, so new behaviour % \~/python/installs/py36-opt/bin/python3 create_small_zip64.py -v --mode w --preamble '#!/usr/bin/python' py3-preamble-w.zip create_small_zip64.py % \~/python/installs/py36-opt/bin/python3 create_small_zip64.py -v --mode a --preamble '#!/usr/bin/python' py3-preamble-a.zip create_small_zip64.py
The old zipfiles are fine: % zip -T py2-preamble-w.zip test of py2-preamble-w.zip OK % zip -T py2-preamble-a.zip test of py2-preamble-a.zip OK
The new one using 'w' is also fine (as expected): % zip -T py3-preamble-w.zip test of py3-preamble-w.zip OK
The new one using 'a' is broken: % zip -T py3-preamble-a.zip warning [py3-preamble-a.zip]: 17 extra bytes at beginning or within zipfile (attempting to process anyway) test of py3-preamble-a.zip FAILED
zip error: Zip file invalid, could not spawn unzip, or wrong unzip (original files unmodified)
The 'unzip' tool does work, but it also prints a warning: % unzip -l py3-preamble-a.zip Archive: py3-preamble-a.zip warning [py3-preamble-a.zip]: 17 extra bytes at beginning or within zipfile (attempting to process anyway) Length Date Time Name --------- ---------- ----- ---- 4016 2017-05-03 14:23 create_small_zip64.py --------- ------- 4016 1 file
Whether other tools try to compensate for the error depends greatly on the tool; there's quite a few that don't.
For the record, we had two different bits of code that created zipfiles with preambles using mode='a', created by (at least) two different people. I don't think it's unreasonable to assume that if you have a file with existing data you don't want the ZipFile to overwrite, it should be using mode 'a' :P
... and yes, the changes to bpo-26293 should be reverted; the changes attached to this issue merely revert part of bpo-26293's changes.
Thank you for raising this again and for your research Thomas. I now see that my understanding of the specification was wrong. I'll revert changes to zipfile.
There is one question. If the zip file was concatenated to other file (and looks incorrect from the view of other tools), and we opened it in append mode and added new entries, should the central directory count offsets from the start of the file or from the start of embedded archive? In the first case offsets of first entries (which was in the original concatenated archive) in the central directory will be different from offsets in local headers. In the second case the zip file will look incorrect from the view of other tools (but it already looked incorrect before appending).
Well, what should the zipfile module do when you open a broken zipfile for appending in the first place? :) There are many ways in which it could be broken. I don't think a zipfile with incorrect offsets should be treated any differently.
(I don't know *how* the zipfile module should treat broken zipfiles, and I don't know how it treats them now -- but I think the most important thing is consistency and documenting the choice :P)
New changeset ef4c6ba169ff59215442fc4047d83f7a3d39bf39 by Benjamin Peterson in branch '2.7': Revert "Issue bpo-29094: Offsets in a ZIP file created with extern file object and modes" (bpo-1467) https://github.com/python/cpython/commit/ef4c6ba169ff59215442fc4047d83f7a3d39bf39
New changeset 3763ea865cee5bbabcce11cd577811135e0fc747 by Serhiy Storchaka in branch 'master': Revert bpo-26293 for zipfile breakage. See also bpo-29094. (bpo-1484) https://github.com/python/cpython/commit/3763ea865cee5bbabcce11cd577811135e0fc747
New changeset 70dc6a7a0b7f104d87512556fca242c2ca96a010 by Serhiy Storchaka in branch '3.6': [3.6] Revert bpo-26293 for zipfile breakage. See also bpo-29094. (GH-1484). (bpo-1485) https://github.com/python/cpython/commit/70dc6a7a0b7f104d87512556fca242c2ca96a010
New changeset c8faabce6ef318f3b425c6defd846e274d61e2ef by Serhiy Storchaka in branch '3.5': [3.5] Revert bpo-26293 for zipfile breakage. See also bpo-29094. (GH-1484). (bpo-1486) https://github.com/python/cpython/commit/c8faabce6ef318f3b425c6defd846e274d61e2ef
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = 'https://github.com/serhiy-storchaka' closed_at =
created_at =
labels = ['3.7', 'type-bug', 'library', 'release-blocker']
title = 'Regression in zipfile writing in 2.7.13'
updated_at =
user = 'https://github.com/peterebden'
```
bugs.python.org fields:
```python
activity =
actor = 'serhiy.storchaka'
assignee = 'serhiy.storchaka'
closed = True
closed_date =
closer = 'serhiy.storchaka'
components = ['Library (Lib)']
creation =
creator = 'Peter Ebden'
dependencies = []
files = ['46107']
hgrepos = []
issue_num = 29094
keywords = ['patch']
message_count = 21.0
messages = ['284172', '284421', '284423', '284424', '284425', '284432', '284436', '292803', '292826', '292827', '292828', '292838', '292842', '292896', '292897', '292914', '293025', '293047', '293154', '293156', '293158']
nosy_count = 9.0
nosy_names = ['twouters', 'gregory.p.smith', 'vstinner', 'larry', 'benjamin.peterson', 'ned.deily', 'python-dev', 'serhiy.storchaka', 'Peter Ebden']
pr_nums = ['1467', '1484', '1485', '1486']
priority = 'release blocker'
resolution = 'fixed'
stage = 'resolved'
status = 'closed'
superseder = None
type = 'behavior'
url = 'https://bugs.python.org/issue29094'
versions = ['Python 3.5', 'Python 3.6', 'Python 3.7']
```