python / cpython

The Python programming language
https://www.python.org
Other
63.15k stars 30.23k forks source link

ZipFile does not supports Unicode Path Extra Field (0x7075) zip header field #86094

Closed 0c49011a-ab29-4b67-907b-a43ebd6cc295 closed 1 year ago

0c49011a-ab29-4b67-907b-a43ebd6cc295 commented 4 years ago
BPO 41928
Nosy @agiudiceandrea
PRs
  • python/cpython#23736
  • Files
  • 23.zip
  • Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.

    Show more details

    GitHub fields: ```python assignee = None closed_at = None created_at = labels = ['type-feature', 'library', '3.10'] title = 'ZipFile does not supports Unicode Path Extra Field (0x7075) zip header field' updated_at = user = 'https://bugs.python.org/ivansorokintech' ``` bugs.python.org fields: ```python activity = actor = 'andreaerdna' assignee = 'none' closed = False closed_date = None closer = None components = ['Library (Lib)'] creation = creator = 'ivan.sorokin.tech' dependencies = [] files = ['49491'] hgrepos = [] issue_num = 41928 keywords = ['patch'] message_count = 3.0 messages = ['377931', '377945', '385467'] nosy_count = 2.0 nosy_names = ['ivan.sorokin.tech', 'andreaerdna'] pr_nums = ['23736'] priority = 'normal' resolution = None stage = 'patch review' status = 'open' superseder = None type = 'enhancement' url = 'https://bugs.python.org/issue41928' versions = ['Python 3.10'] ```

    Linked PRs

    0c49011a-ab29-4b67-907b-a43ebd6cc295 commented 4 years ago

    See attached sample. Well-known unzip command line tool lists its contents correctly:

    $ unzip -l 23.zip
    Archive:  23.zip
      Length      Date    Time    Name
    ---------  ---------- -----   

    81408  2012-10-23 19:03   Β' ΦΑΣΗ ΠΕ06 ΣΧΟΛΕΙΑ ΕΑΕΠ (ΙΝΤ).xls

    --------- ------- 81408 1 file

    But ZipFile lists the same file inside this archive as ü' öÇæå Åä06 æòÄèäêÇ äÇäÅ (êîÆ).xls

    It's because ZipFile completely ignores Unicode Path Extra Field (0x7075) zip header field.

    See .ZIP specification for details on this field meaning and usage: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT

    0c49011a-ab29-4b67-907b-a43ebd6cc295 commented 4 years ago

    Grand unified algorithm to read filenames from zip files correctly:

    1. Do zip entry have «Unicode Path Extra Field» (0x7075)? Use it for file name.
    2. Is Unicode flag (0x800) set in «Flags» Field of zip entry? Assume «Filename» Field is in UTF-8.
    3. Do «HostOS» Field of zip entry have values of 0 (FAT) or 11 (NTFS)? Assume «Filename» Field is in OEM charset corresponding to system locale.
    4. Assume «Filename» Field is in UTF-8.

    p7zip with oemcp patch (https://github.com/unxed/oemcp/) uses exactly this method, and is able to process all zip files in my test set correctly (my test set contains several zips generated by different packers on windows, macos, linux, and by online services). The same algorithm should be used in any zip unpacker wishing to process non-latin filenames as gently as possible.

    5da96943-6587-42f9-a33c-9f0eb4f94777 commented 3 years ago

    I submitted more than a month ago a PR that adds support for Unicode Path Extra Field in ZipFile. The PR https://github.com/python/cpython/pull/23736 is awaiting a review in order to be merged.