python / cpython

The Python programming language
https://www.python.org
Other
63.38k stars 30.35k forks source link

ZipInfo corrupts file names in some old zip archives #84353

Closed 71d071f9-9473-4c46-8a5d-48cf8e75fd87 closed 1 year ago

71d071f9-9473-4c46-8a5d-48cf8e75fd87 commented 4 years ago
BPO 40172
Nosy @gpshead, @danifus, @yudilevi2, @iritkatriel
PRs
  • python/cpython#19335
  • Files
  • example.zip
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['3.8', 'type-bug', 'library'] title = 'ZipInfo corrupts file names in some old zip archives' updated_at = user = 'https://github.com/yudilevi2' ``` bugs.python.org fields: ```python activity = actor = 'gregory.p.smith' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'yudilevi' dependencies = [] files = ['49030'] hgrepos = [] issue_num = 40172 keywords = ['patch'] message_count = 8.0 messages = ['365701', '393766', '393767', '394293', '394505', '414601', '415741', '415745'] nosy_count = 4.0 nosy_names = ['gregory.p.smith', 'dhillier', 'yudilevi', 'iritkatriel'] pr_nums = ['19335'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'behavior' url = 'https://bugs.python.org/issue40172' versions = ['Python 3.8'] ```

    Linked PRs

    71d071f9-9473-4c46-8a5d-48cf8e75fd87 commented 4 years ago

    Some old zip files that don't yet use unicode file names might have entries with characters beyond the ascii range. ZipInfo seems to encode these file names with 'cp437' codepage (correct for old zips) but decode them back with 'ascii' code page which might corrupt them.

    iritkatriel commented 3 years ago

    Can you suggest a unit test for this?

    71d071f9-9473-4c46-8a5d-48cf8e75fd87 commented 3 years ago

    Hey :)

    Sorry that I'm not responsive, just busy. I'll add one soon.

    Yudi

    On Mon, May 17, 2021 at 12:08 AM Irit Katriel \report@bugs.python.org\ wrote:

    Irit Katriel \iritkatriel@yahoo.com\ added the comment:

    Can you suggest a unit test for this?

    ---------- nosy: +iritkatriel


    Python tracker \report@bugs.python.org\ \https://bugs.python.org/issue40172\


    e7453047-905b-492e-9222-15edd130d637 commented 3 years ago

    zipfile decodes filenames using cp437 or unicode and encodes using ascii or unicode. It seems like zipfile has a preference for writing filenames in unicode rather than cp437. Is zipfile's preference for writing filenames in unicode rather than cp437 intentional?

    Is the bug you're seeing related to using zipfile to open and rewrite old zips and not being able to open the rewritten files in an old program that doesn't support the unicode flag?

    We could address this two ways:

    I guess the choice will depend on if preferring unicode rather than cp437 is intentional and if writing filenames in cp437 will break anything (it shouldn't break anything according to Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT)

    Here's a test for your current patch (I'd probably put it alongside OtherTests.test_read_after_write_unicode_filenames as this test was adapted from that one)

    class OtherTests(unittest.TestCase):
        ...
    
        def test_read_after_write_cp437_filenames(self):
            fname = 'test_cp437_é'
            with zipfile.ZipFile(TESTFN2, 'w') as zipfp:
                zipfp.writestr(fname, b'sample')
    
            with zipfile.ZipFile(TESTFN2) as zipfp:
                zinfo = zipfp.infolist()[0]
                # Ensure general purpose bit 11 (Language encoding flag
                # (EFS)) is unset to indicate the filename is not unicode
                self.assertFalse(zinfo.flag_bits & 0x800)
                self.assertEqual(zipfp.read(fname), b'sample')
    e7453047-905b-492e-9222-15edd130d637 commented 3 years ago

    Looking into this more and it appears that while Appendix D of https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT says "If general purpose bit 11 is unset, the file name and comment SHOULD conform to the original ZIP character encoding" where the original encoding is IBM 437 (cp437), this is not always followed. This isn't too surprising as cp437 doesn't have every character for every language! In particular, some archive programs on windows will use the user's locale code page.

    https://superuser.com/questions/1321371/proper-encoding-for-file-names-in-zip-archives-created-in-windows-and-unpacked-i

    A UTF filename can be stored in the extra field 0x7075 in addition to a filename encoded in an arbitrary code page stored in the header's filename section. There is a open issue to add handling these fields (for reading) to zipfile: https://bugs.python.org/issue41928 and that issue may be related to this one https://bugs.python.org/issue40407

    For this issue, with regards to encoding, I prefer the current situation where general purpose bit 11 for UTF is preferentially used because it doesn't change the behaviour compared to previous Python versions and it reduces file size as the filename isn't repeated in the extra field.

    For compatibility with other archive programs that don't support the general purpose bit 11, I suggest we add an additional mechanism to allow the code page for the path name (and comment) to be set and use the 0x7075 extra field to store the UTF name in those cases where the filename can't be encoded in ascii (and 0x6075 to store the utf comment where it can't be encoded in ascii)

    71d071f9-9473-4c46-8a5d-48cf8e75fd87 commented 2 years ago

    The main issue is that when extracting older zip files, files are actually written to disk with corrupted (altered) names. Unfortunately it's been a while since I saw this issue and I can't tell if it was fixed or if I simply can't reproduce it. I do see that encoding/decoding in ZipInfo is still inconsistent, sometimes uses ascii codepage and sometimes uses cp437 codepage which seems wrong to me. Not sure how we should handle it but I think that switching the default ascii encoding to cp437 to be consistent with the old implementation (and with the filename decoding) seems like the right way to go.

    e7453047-905b-492e-9222-15edd130d637 commented 2 years ago

    Related to issue https://bugs.python.org/issue28080 which has a patch that covers a bit of this issue

    gpshead commented 2 years ago

    Examining Lib/zipfile.py code, the existing code makes sense. Python's zipfile module produces modern zipfiles when writing by setting the utf-8 flag and storing the filename as utf-8 when it is not ASCII. This is desirable for use with all normal zip implementations in the past 10-15 years.

    When decoding a zipfile, if the utf-8 flag is not set, we assume cp437 per the pkware zip appnotes.txt "spec". So our reading is correct as well, even for very old files.

    This is being strict in what we produce an lenient in what we accept. caveats? yes:

    If someone does need to produce zipfiles for use with ancient software that does not support utf-8, that also does not identify the unknown utf-8 flag as an error condition, it will interpret the name in a corrupt manner for non-ascii names.

    Similarly, even if written with cp437 names (as PR 19335 would do), in old zip system implementations where the implementation blindly uses the users locale encoding instead of cp437, it will always see corrupt data in that scenario. (aka mojibake?)

    These are not what I'd expect to be normal use cases. Do you have a common practical example of a need for this?

    (The PR on bpo-28080 provides a way to _read_ legacy zip files that used a codec other than cp437 if you know what it was.)

    ---

    https://www.loc.gov/preservation/digital/formats/fdd/fdd000354.shtml may also be of interest regarding the zip format.

    distantvale commented 1 year ago

    The current implementation:

    def _encodeFilenameFlags(self):
        try:
            return self.filename.encode('ascii'), self.flag_bits
        except UnicodeEncodeError:
            return self.filename.encode('utf-8'), self.flag_bits | _MASK_UTF_FILENAME
    

    Corrupts old file names, and the problem isn't between ascii and utf-8, the problem is between ascii and cp437.

    The following implementation fixes that issue:

    def _encodeFilenameFlags(self):
        try:
            return self.filename.encode('cp437'), self.flag_bits
        except UnicodeEncodeError:
            return self.filename.encode('utf-8'), self.flag_bits | _MASK_UTF_FILENAME
    

    So it's a 1 liner change.

    Note that it's also consistent with the current implementation for reading the encoding:

    if flags & _MASK_UTF_FILENAME:
            # UTF-8 file names extension
            filename = filename.decode('utf-8')
        else:
            # Historical ZIP filename encoding
            filename = filename.decode(self.metadata_encoding or 'cp437')

    So it seems to be the proper way to handle it.

    distantvale commented 1 year ago

    Can anyone comment on this?

    arhadthedev commented 1 year ago

    @distantvale I've retriggered checks in gh-19335. Now it waits for the CLA, a NEWS entry, a test, and rebasing onto main instead of 3.9.

    Considering triviality of gh-19335, you can file your own pull request with all the required stuff. Or I can do it this weekend.

    distantvale commented 1 year ago

    hey @arhadthedev - I didn't get a chance to open the PR - did you make any progress or should I take over?

    arhadthedev commented 1 year ago

    or should I take over?

    I haven't got there yet so you're more than welcome to go ahead.

    gpshead commented 1 year ago

    yep a new PR with the one line fix will work (that old one was made against the wrong branch).

    arhadthedev commented 1 year ago

    Well, it's another time when I reacted to a so-simple-what-can-be-wrong-here PR forgetting to look into the associated issue...

    https://github.com/python/cpython/pull/19335#issuecomment-1541036518:

    this PR needs to be re-created by someone against the main branch.

    being unaware that it's the very issue from my deep backlog where I answered this:

    or should I take over?

    I haven't got there yet so you're more than welcome to go ahead.

    gpshead commented 1 year ago

    Since the older comment thread on this issue, https://github.com/python/cpython/issues/72267 (bpo-28080) was resolved in 3.11+ via PR https://github.com/python/cpython/pull/32007. A metadata_encoding= parameter was added to ZipFile for use when reading.

    The updated PR #104347 with the proposed one line "fix" may just return us to producing zip files of questionable clarity.

    It is important to note that ascii is strict subset of cp437. So nothing reading a zip file produced by the Python zipfile module should ever be confused by it's output. Unless it is so old as to not understand the UTF-8 flag. We don't need to cater to software that ancient.

    If we went forward with that PR #104347, we'd be re-creating the since fixed https://github.com/python/cpython/issues/95463 regression from 3.11 again.

    It still isn't clear to me what the original problem that this issue was filed for really is. What the zipfile module produces is correct. There is no corruption.

    The main issue is that when extracting older zip files, files are actually written to disk with corrupted (altered) names. Unfortunately it's been a while since I saw this issue and I can't tell if it was fixed or if I simply can't reproduce it.

    ... meaning we have no identifiable bug. No evidence of corrupt unintelligible incorrect metadata encoding zip files produced by Python zipfile module.

    I do see that encoding/decoding in ZipInfo is still inconsistent, sometimes uses ascii codepage and sometimes uses cp437 codepage which seems wrong to me.

    Intentionally so. ascii is a subset of both cp437 and utf-8 so it is always safe to emit. When non-ascii characters are found we emit utf-8 which carries a dedicated flag to indicate that. As zip files without the utf-8 flag set can have metadata written in any undeclared encoding, it is not correct for us to emit cp437 as there is no guarantee that a reader would decode using that undeclared (merely a "suggested default") codec.

    Not sure how we should handle it but I think that switching the default ascii encoding to cp437 to be consistent with the old implementation (and with the filename decoding) seems like the right way to go.

    Having read things over, I disagree. We're emitting the most consistently understandable zip files we can with the actual encoding declared.

    I'm going to close this as Not Planned as I haven't seen evidence that the Python zipfile module does anything wrong here. Feel free to reopen it if anyone can provide such concrete filename metadata encoding mixup test cases.

    distantvale commented 8 months ago

    @gpshead the main issue isn't supporting old software but rather not corrupting old zip files. The latest python zipfile implementation actually corrupts some old zip files that use special characters in file names, as the file names in these were written using cp437.

    When file names in old zip files have special characters and were created with cp437 encoding, there are actually 2 issues:

    1. When opening these for writing, the new implementation actually writes encodes file names in the CD as ascii:

      def _encodeFilenameFlags(self):
          try:
              return self.filename.encode('ascii'), self.flag_bits
          except UnicodeEncodeError:
              return self.filename.encode('utf-8'), self.flag_bits | _MASK_UTF_FILENAME

      but the local file header is left intact - still cp437 encoded, which results in a corrupt zip file (filename mismatch between CD and LC).

    2. if both were updated to ascii encoding, and when special characters were using in the file names, the new implementation actually modifies (and thus corrupts) these file names.

    I do believe internal zip data should remain untouched/consistent and that new software shouldn't attempt to "fix" old zip files unless explicitely requested to do so, especially if the correct fix is subjective.

    gpshead commented 8 months ago

    can you provide a small sample .zip file using cp437 non-ascii filenames and small python code snippet you are using to modify it that demonstrates that for use as a regression test?

    distantvale commented 7 months ago

    @gpshead sure, please see attached zip - it contains a very small test zip + a script that updates the comment in the test zip from "foo" to "bar". The test zip only contains a single empty file (the content is irrelevant). Note that prior to running the script, the file name in the local file header/CD is consistent. Once you run it, you'll notice that the comment is indeed updated, but the file name in the CD gets corrupted.

    name_bug.zip

    distantvale commented 7 months ago

    hey @gpshead did you get a chance to look into this?

    distantvale commented 4 months ago

    bump

    distantvale commented 2 months ago

    can any dev look at that? it's been a long time and the bug is clearly visible in the example I posted