Closed 0c49011a-ab29-4b67-907b-a43ebd6cc295 closed 1 year ago
See attached sample. Well-known unzip command line tool lists its contents correctly:
$ unzip -l 23.zip
Archive: 23.zip
Length Date Time Name
--------- ---------- -----
81408 2012-10-23 19:03 Β' ΦΑΣΗ ΠΕ06 ΣΧΟΛΕΙΑ ΕΑΕΠ (ΙΝΤ).xls
--------- ------- 81408 1 file
But ZipFile lists the same file inside this archive as ü' öÇæå Åä06 æòÄèäêÇ äÇäÅ (êîÆ).xls
It's because ZipFile completely ignores Unicode Path Extra Field (0x7075) zip header field.
See .ZIP specification for details on this field meaning and usage: https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT
Grand unified algorithm to read filenames from zip files correctly:
p7zip with oemcp patch (https://github.com/unxed/oemcp/) uses exactly this method, and is able to process all zip files in my test set correctly (my test set contains several zips generated by different packers on windows, macos, linux, and by online services). The same algorithm should be used in any zip unpacker wishing to process non-latin filenames as gently as possible.
I submitted more than a month ago a PR that adds support for Unicode Path Extra Field in ZipFile. The PR https://github.com/python/cpython/pull/23736 is awaiting a review in order to be merged.
Note: these values reflect the state of the issue at the time it was migrated and might not reflect the current state.
Show more details
GitHub fields: ```python assignee = None closed_at = None created_at =
labels = ['type-feature', 'library', '3.10']
title = 'ZipFile does not supports Unicode Path Extra Field (0x7075) zip header field'
updated_at =
user = 'https://bugs.python.org/ivansorokintech'
```
bugs.python.org fields:
```python
activity =
actor = 'andreaerdna'
assignee = 'none'
closed = False
closed_date = None
closer = None
components = ['Library (Lib)']
creation =
creator = 'ivan.sorokin.tech'
dependencies = []
files = ['49491']
hgrepos = []
issue_num = 41928
keywords = ['patch']
message_count = 3.0
messages = ['377931', '377945', '385467']
nosy_count = 2.0
nosy_names = ['ivan.sorokin.tech', 'andreaerdna']
pr_nums = ['23736']
priority = 'normal'
resolution = None
stage = 'patch review'
status = 'open'
superseder = None
type = 'enhancement'
url = 'https://bugs.python.org/issue41928'
versions = ['Python 3.10']
```
Linked PRs