Prefer to open zip files with unzip instead of 7z

gapan commented 11 years ago

7z and also versions of unzip older than 6.10b, don't support utf8 properly. Zip archives created on windows that have files with non-latin filenames show up as garbage.

Here's a sample zip archive that displays this problem: http://pnboy.pinguix.com/gapan/23-10-2012-b-fasi-eaep.zip

Same problem is with cbz files, which are actually zip files.

Removing/commenting out lines 527 and 530 from src/fr-command-7z.c fixes this and opens zip files with unzip, but it removes the mimetypes from 7z completely.

Ideally, priority should be given to unzip first and if unzip is not present, fall back to 7z, if 7z is installed.

unxed commented 4 years ago

@alkisg new patches for unzip and p7zip using your code snippet (thanks!) new_patches.zip Please test before I upload them to upstream.

btw, one more potential issue is Windows support (both p7zip and unzip are cross platform). Currently our code compiles only for *nix platforms, but oem charset issue may also happen on Windows. Unfortunately I have no Windows to check if it is actually so or some Windows specific code is used in those tools to process oem charset correctly.

UPD: Anyway LANG=el 7za l sample.zip is not working as it sets terminal charset to default ANSI_X3.4-1968, so both p7zip and unzip think that it is actual system charset and try to transcode file names to it (instead of UTF-8), so file names are ruined both in terminal and file system.

alkisg commented 4 years ago

I uploaded the packages in my PPA: https://launchpad.net/~alkisg/+archive/ubuntu/ppa/+packages

Launchpad seems rather busy these days, it took a whole day to compile the older patch. I'll try to test the new patches, but it might take me a while due to other issues. Thanks!

alkisg commented 4 years ago

I tested the new patches, they're fine. I uploaded them in my PPA. I also updated my gist a bit to account for the "@" sign in locales like el_GR@euro.UTF-8.

Note that Debian supports the following locales (/usr/share/i18n/SUPPORTED), that are not in our list:

aa_DJ aa_ER aa_ET agr_PE ak_GH am_ET an_ES anp_IN ar_IN ar_SD ar_SS as_IN ayc_PE az_IR bem_ZM ber_DZ ber_MA bhb_IN bho_IN bho_NP bi_VU bn_BD bn_IN bo_CN bo_IN brx_IN bs_BA byn_ER ca_AD ca_FR ca_IT ce_RU chr_US ckb_IQ cmn_TW crh_UA csb_PL cv_RU de_BE de_IT doi_IN dsb_DE dv_MV dz_BT el_CY en_AG en_BW en_DK en_HK en_IL en_IN en_NG en_SC en_SG en_ZM eo_US es_CU es_US eu_FR ff_SN fil_PH fur_IT fy_DE fy_NL gez_ER gez_ET gu_IN gv_GB hak_TW ha_NG hif_FJ hi_IN hne_IN hsb_DE ht_HT hy_AM ia_FR ig_NG ik_CA iu_CA kab_DZ ka_GE kl_GL km_KH kn_IN kok_IN ks_IN ku_TR lb_LU lg_UG li_BE lij_IT li_NL ln_CD lo_LA lzh_TW mag_IN mai_IN mai_NP mfe_MU mg_MG mhr_RU mi_NZ miq_NI mjw_IN ml_IN mni_IN mnw_MM mr_IN mt_MT my_MM nan_TW nds_DE nds_NL ne_NP nhn_MX niu_NU niu_NZ nl_AW nr_ZA nso_ZA oc_FR om_ET om_KE or_IN os_RU pa_IN pap_AW pap_CW pa_PK ps_AF quz_PE raj_IN ru_UA rw_RW sah_RU sa_IN sat_IN sc_IT sd_IN sd_PK se_NO sgs_LT shn_MM shs_CA sid_ET si_LK sm_WS so_DJ so_ET so_KE so_SO sq_MK sr_ME ss_ZA st_ZA sw_TZ szl_PL ta_IN ta_LK tcy_IN te_IN tg_TJ the_NP ti_ER ti_ET tig_ER tk_TM tl_PH tn_ZA to_TO tpi_PG tr_CY ts_ZA ug_CN unm_US ur_IN ve_ZA wae_CH wal_ET wo_SN xh_ZA yi_US yo_NG yue_HK yuw_PG zu_ZA

I couldn't come up with a nice locale => OEM list, the best documentation I've found so far were these two.

ghost commented 4 years ago

@alkisg

If we don't hear back from the p7zip developer in a reasonable timeframe, we can file disto-specific bug reports, and ask them to incorporate the patches.

The original p7zip repository seems to be abandoned after version 16.02.

The Links section on 7-Zip homepage now lists szcnick/p7zip as the new fork of the p7zip code. It has support for codecs not supported in upstream 7-Zip, such as Zstd and LZ4. The main developer said the fork's "17.02" release is based on 7-Zip 17.01. (szcnick/p7zip#18)

Another fork is 3rdpartyrepos/p7zip. The updatesFrom7Zip branch seems to be based on 16.04, while the developer is attempting to port p7zip to 17.00 in the 1604_to_1700 branch.

To sum up, it might be a good idea to seek a good p7zip fork or create a new p7zip fork, if we can find developers who are interested in maintaining p7zip. The same applies to Info-ZIP's ZIP and UNZIP programs as well - the latest releases were more than 10 years ago.

alkisg commented 4 years ago

lsar has no such requirement. When the correct encoding has been detected...

Autodetection is so unreliable that IMHO it shouldn't even be attempted:

$ lsar win10test.zip 
win10test.zip: Zip
Œâ¦ âšš¨˜¦ ¡œ £â¤¦¬.txt
$ lsar -e cp737 win10test.zip 
win10test.zip: Zip
Νέο έγγραφο κειμένου.txt

I.e. lsar autodetection fails and needs a terminal for me to pass the -e parameter, while the proposed p7zip patch would just work, as long as I'm opening .zips from the same locale. This is what Windows does, and it would make sense to do the same in Linux too.

If setting LANG= just for the p7zip process sounds problematic (why? overriding it just needs a /usr/local/bin/p7zip wrapper, and works for all GUIs), then maybe a second environment variable could be supported. For example, exporting OEMCP=cp737 could completely skip the locale-to-codepage logic and force the use of cp737 in p7zip, unzip etc. Does that sound OK?

Finally, regarding p7zip development, this is just 1 out of 1000 papercuts that affect Greek schools, and we're developing several upstream projects as well (LTSP, epoptes etc), so personally I can't invest too much time in it... I've uploaded patches packages to our PPA, and I can file a bug report against Debian, if noone else does, and then I'll just hope that someone else can push more for getting the patch accepted.

Thank you all!

alkisg commented 4 years ago

Encoding: Shift_JIS (57% confidence)

To me, a decompressing solution that has 57% or even 95% success rate feels completely unacceptable. It's a bomb; users would decompress .zip files and at some point they'd discover that some or all of their filenames have been damaged; and they'd have to start looking in all their disk for such possibly damaged filenames from weeks or months ago, because noone notified them that the software they use only works some times.

Facial recognition or OCR software is allowed to work "some times" by its nature, but not normal programs like archivers, file managers, calculators, editors etc. Users expect these to work consistently.

alkisg commented 4 years ago

To sum up:

If OEMCP is defined, we use it. This allows using any iconv encoding without relying in LANG. It does require manual configuration from the user or an UI option, and that's fine.
If not, we try to map the current locale to OEMCP. This allows users to unzip files from their locale without configuring anything, and it's the most important part. The requirement to "have the locale generated" doesn't matter in this case, as we're using the user's locale, so it's already generated.

Does everyone agree on these 2?

The (1) part can be easily implemented by adding the following lines to unxed's existing patch:

    oemcp = getenv("OEMCP");

    if (!oemcp) {
        oemcp = "CP437";
       ...(rest of the locale-to-oemcp code here)

I tested that and it works fine:

$ OEMCP=Shift_JIS 7za l nenngajyou-data.zip 
...
2014-10-28 09:43:06 ....A        21976        20162  nenngajyou-data/年賀状住所印刷テンプレート.odt
2014-10-28 09:26:48 ....A        80823        77876  nenngajyou-data/年賀状住所録.ods

ghost commented 4 years ago

Can you please post the full patch for p7zip? Thanks!

(I see that the updated version uses oemcp while the previous version uses oem_cp, so I am not sure whether I have patched correctly. Sorry for the inconvenience caused.)

alkisg commented 4 years ago

Let's wait for @unxed to upload whatever he thinks is best for a final version, as he's the one that developed the patch in the first place.

I used oemcp without the dash because that's what Windows calls it. E.g. google for "oemcp" including the quotes.

unxed commented 4 years ago

Sorry, guys, overloaded with tasks now.

One important thing: our p7zip patch was buggy.

Instead of

+  if (!isUtf8) {
+    const char *lc_to_cp_table[] = {

we should use

+  Byte hostOS = GetHostOS();
+  if (!isUtf8 && ((hostOS == NFileHeader::NHostOS::kFAT) || (hostOS == NFileHeader::NHostOS::kNTFS))) {
+    const char *lc_to_cp_table[] = {

The problem is MacOS X. It writes file names in UTF8, but does not sets UTF8 header flag. But we can assume that file names in archive are written in OEM code page by DOS and Windows archivers only, so we check platform flag in zip header and do not do code page conversion for archives created on OS X.

Sample attached. Folder in archive should be named "Штабик 2020", not "╨и╤В╨░╨▒╨╕╨║ 2020". 1.zip

unxed commented 4 years ago

Let's consider this p7zip patch version as 'relatively final': https://github.com/unxed/oemcp/blob/master/p7zip_oemcp_ZipItem.cpp.patch

Of course we still need to fix last two issues from here https://github.com/mate-desktop/engrampa/issues/5#issuecomment-648924235.

alkisg commented 4 years ago

@unxed, great, thank you,

There are two CPs for "az_AZ" in our table. Not sure how to distinguish...

I think we should use the most common one, and the users can override it with the OEMCP environment variable. For example, for Greece, there were actually many different encodings, not just cp737. But if some user managed to create a .zip with cp869, I wouldn't want p7zip to try to autodetect that (with a 50% success rate!), as it's very rare. I would want to manually set OEMCP=cp869. (edit: note that this isn't related to the Linux environment where p7zip runs, but to the Windows environment where the .zip was created, that's why I'm suggesting we shouldn't try to autodetect that)

iconv does not support CP720...

I think we should NOT try to workaround iconv shortcomings. Affected users should try to send a patch to iconv, which will make all programs that use iconv work, not just p7zip. Let's keep the code clean and patches where they belong.

Let's consider this p7zip patch version as 'relatively final'...

Could you please attach it to the upstream bug report?

I think we should close this engrampa issue now, and we should focus on the upstream bug reports, and on distribution-specific bug reports. I will file a bug report in Debian for this, and I think I should link mainly to the upstream bug, not to an engrampa issue, even though here's where most of the chat happened... So let's continue the chat there in the p7zip bug tracker. Many thanks to the MATE developers that hosted all this chat here.

unxed commented 4 years ago

we should use the most common one

Not sure how to find out most common ones, so let's wait for reports from people who are actually using that locales.

Could you please attach it to the upstream bug report?

Done :)

UPD: Updated unzip patch for OEMCP env variable support also: https://sourceforge.net/p/infozip/patches/29/

UPD#2: unzip patch in Debian https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=197427#70 p7zip patch in Debian https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=965126

UPD#3: Found older similar approach: https://github.com/vitlav/libnatspec It also uses CP data from Wine: https://github.com/vitlav/libnatspec/blob/master/lib/data/get_charset_data.h There are patches for unzip and p7zip using this lib: https://github.com/zip-i18n https://aur.archlinux.org/packages/unzip-natspec/ https://aur.archlinux.org/packages/p7zip-natspec/ ...and even ppa with builds for 18.04 https://launchpad.net/~spvkgn/+archive/ubuntu/p7zip-natspec

libnatspec was last updated in 2010. It has some pros and contras against my approach. Pros: 1) it differs uz_UZ and uz_UZ@cyrillic 2) it better supports ukrainian; contras: it does not support greek. There may be other differences as well. Seems libnatspec's data is not raw Wine dump, but is manually tuned somehow. Request for code page info update from recent Wine: https://github.com/vitlav/libnatspec/issues/3

UPD#4: Another interesting issue is how to create .zip files on linux with filenames readable both by windows and *nixes and preserve all UTF8 characters when possible. It's not that easy, still possible. We should write "0" to "HostOS" flag, "0" to UTF8 flag, write file names in OEM code page and also write UTF8 version of file names in additional 0x7075 field. This is exactly that winzip and winrar do. Unfortunately p7zip and info-zip do not mimic this behavior currently, so zip files generated by them have ruined filenames when opened on windows. Sample of correct multi-platform .zip with both OEM and UTF8 filenames versions (files inside should be named "абвгде" and "жзийкл"): winzip.zip Suggested zip-i18n developer to implement this logic: https://github.com/zip-i18n/p7zip/issues/1 https://github.com/zip-i18n/zip/issues/1

unxed commented 4 years ago

Patch accepted to szcnick's p7zip fork, hooray! @alkisg what about switching to this repo as source for your ppa builds?

https://github.com/szcnick/p7zip/commit/e56ea97d89eb0cd59603402496a8208238b3fda2

alkisg commented 4 years ago

@unxed, thank you for filing https://bugs.debian.org/965126! All affected users in Debian-based distributions should comment there, so that the patch gets accepted faster. For other distributions, similar bug reports should be filed.

Regarding szcnick's p7zip fork, that would be another bug report for Debian to switch to that, as it's more actively maintained. The PPA builds are a temporary solution until the patch is accepted in Debian; it's not a good idea to switch to a different source than Debian as then this will need to be maintained indefinitely.

sc0w commented 4 years ago

I think p7zip fork must change the name, and then the people can propose it to package for distros, if not I think it never succeed.

unxed commented 4 years ago

@sc0w Sounds logical. Maybe suggest it in their bug tracker? https://github.com/szcnick/p7zip/issues

unxed commented 3 years ago

One more linux tool now have smart .zip oem charset detection implemented. Btw, there is another table of locale to oem code page translation. Not sure if it is better then mine.

Also found one more bug in unzip. UPD: and one more.

unxed commented 3 years ago

Python's ZipFile is also suffering from charset problems. Wrote two issues to it: https://bugs.python.org/issue41928 (about supporting unicode name extra field 0x7075) https://bugs.python.org/issue41929 (about using system locale for oem code page selection)

unxed commented 3 years ago

Grand unified algorithm to read filenames from zip files correctly:

1) Do zip entry have «Unicode Path Extra Field» (0x7075)? Use it for file name. 2) Is Unicode flag (0x800) set in «Flags» Field of zip entry? Assume «Filename» Field is in UTF-8. 3) Do «HostOS» Field of zip entry have values of 0 (FAT) or 11 (NTFS)? Assume «Filename» Field is in OEM charset corresponding to system locale. 4) Assume «Filename» Field is in UTF-8.

Patched p7zip uses exactly this method, and is able to process all zip files in my test set correctly (my test set contains several zips generated by different packers on windows, macos, linux, and by online services). The same algorithm should be used in any zip unpacker wishing to process non-latin filenames as gently as possible.

Single line solution for Engrampa on Ubuntu 20.04+ amd64: wget https://github.com/unxed/oemcp/raw/master/p7zip-oemcp.deb && sudo dpkg -i p7zip-oemcp.deb

unxed commented 3 years ago

As distros does not seem to hurry up in fixing unzip, I found a workaround that can be used to mitigate the problem here and now. Introducing zipwrapper — a perl script wrapping around zip/unzip and solving all charset problems seamlessly.

Now we can just replace all "zip" and "unzip" strings in fr-command-zip.c to "zipwrapper zip" and "zipwrapper unzip", select zip/unzip by default for zip archives as said in this ticket's title, and have all charset problems solved out of the box without waiting for distro maintainers or asking users to install some .deb, add ppa and/or set some environment variables.

Feedback is appreciated :)

unxed commented 3 years ago

@jloqfjgk can you files write an issue here: https://github.com/unxed/oemcp/issues

unxed commented 3 years ago

Grand unified algorithm to read filenames from zip files correctly:

1. Do zip entry have «[Unicode Path Extra Field](https://pkware.cachefly.net/webdocs/casestudies/APPNOTE.TXT)» (0x7075)? Use it for file name.

2. Is Unicode flag (0x800) set in «Flags» Field of zip entry? Assume «Filename» Field is in UTF-8.

3. Do «HostOS» Field of zip entry have values of 0 (FAT) or 11 (NTFS)? Assume «Filename» Field is in OEM charset [corresponding to system locale](https://github.com/unxed/oemcp/blob/master/oemcp.txt).

4. Assume «Filename» Field is in UTF-8.

Wrote a Perl script demonstrating this logic. Reads .zip, shows files in it, for each file detects correct filename encoding, shows decoded filename, suggests command line switches for popular archivers (unzip, unar, bsdtar) to extract this file correctly (zip format allows different files inside archive to have names in different charsets). An essential tool for everyone who wants to figure out how to work correctly with file name encodings in zip archives.

https://github.com/unxed/oemcp/blob/master/ziplist

Usage: ziplist filename.zip [-p]

-p option is kind of self-testing: it will invoke all suggested archivers in "list files" mode to check if charset options were suggested correctly. Needs to have all three (unzip, unar, bsdtar) archivers installed for this mode to work right.

PS: Perl's Archive::Zip itself can't do this job right out of the box, but fortunately can be easily extended for that.

unxed commented 3 years ago

@alkisg can you please help with troubleshooting this issue: https://github.com/jinfeihan57/p7zip/issues/112

alkisg commented 3 years ago

@unxed, I tried to reproduce it but I wasn't able to. I tested in Ubuntu 20.04 with LANG=en_US.UTF-8 and the last p7zip version in https://github.com/jinfeihan57/p7zip/releases. Maybe it's related to the build options in Arch, but I don't have Arch...

unxed commented 3 years ago

Native 7zip linux port released, wow. Unfortunately, with the same charset bug https://sourceforge.net/p/sevenzip/discussion/45797/thread/cec5e63147/?page=1&limit=25#eaa7

unxed commented 1 year ago

Let's consider this p7zip patch version as 'relatively final': https://github.com/unxed/oemcp/blob/master/p7zip_oemcp_ZipItem.cpp.patch

Of course we still need to fix last two issues from here #5 (comment).

Or libnatspec can be used instead as it was relatively recently updated.

mate-desktop / engrampa

Prefer to open zip files with unzip instead of 7z #5