Prefer to open zip files with unzip instead of 7z

gapan commented 11 years ago

7z and also versions of unzip older than 6.10b, don't support utf8 properly. Zip archives created on windows that have files with non-latin filenames show up as garbage.

Here's a sample zip archive that displays this problem: http://pnboy.pinguix.com/gapan/23-10-2012-b-fasi-eaep.zip

Same problem is with cbz files, which are actually zip files.

Removing/commenting out lines 527 and 530 from src/fr-command-7z.c fixes this and opens zip files with unzip, but it removes the mimetypes from 7z completely.

Ideally, priority should be given to unzip first and if unzip is not present, fall back to 7z, if 7z is installed.

stefano-k commented 11 years ago

I can confirm the issue

ghost commented 11 years ago

So that we have a record on this:

The problem you describe isn't a problem of 7z, it's a problem of whatever you used to create that zip file, and my humble interpretation of the sample file you showed on #mate @ freenode proves it.

If you extract the files with unzip and then zip them with 7z, everything works fine... which leads me to believe that the zip wasn't created in a machine wich supports UTF-8, but instead any sort of iso-western (If you want I can tell you exactly which).

In other words, the problem is with whatever piece of trash that file was zipped. There's no real problem and the performance of zip is worst than the one of p7zip, so we're making a lot of users pay for a problem that was generated probably in a windows machine :)

7z does support UTF-8 properly.

ghost commented 11 years ago

And why 'ideally' the priority to unzip/zip instead of 7z? Can you mention a single technical reason? Are you aware this will affect far more people than you?

stefano-k commented 11 years ago

We will give only the option to choose the desired backend, nothing more

gapan commented 11 years ago

@ketheriel, if the only thing that you want to do is to blame someone, then feel free to blame anyone you like.

Yes, it is the fault of Windows not supporting utf8 properly. But the fact remains that 99% of zip files are created on the Windows platform. *nix users prefer tarballs instead. And the fact remains that all zip files that were created on Windows and that include non-latin filenames display as garbage in linux as it is. I don't care about blaming anyone, I care about fixing this. Of course I am aware that this will affect more people than me. It will fix this nightmare for other people too, not just me.

This issue has been fixed in unzip 6.10b and I would like this fix to be carried into mate-file-archiver.

Ideally, it should prefer unzip over p7zip, because newer versions of unzip behave properly. The technical reason is that otherwise filenames show as garbage. If that's not enough for you, then I don't know what else could be.

Also, performance of unzip is better than p7zip, so your point doesn't stand there.

And p7zip doesn't support utf8 charsets the same way that older versions of unzip didn't. Here's a bug report that says the same: https://bugs.archlinux.org/task/18691

Also, here are bug reports from debian, ubuntu and gentoo about the utf8 mess with unzip: http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=197427 https://bugs.launchpad.net/debian/+source/unzip/+bug/10979 https://bugs.launchpad.net/ubuntu/+source/unzip/+bug/203609 https://bugs.gentoo.org/show_bug.cgi?id=69945

ghost commented 11 years ago

Dude go dig some facts and leave the FUD; you are wrong since moment one, and one easy way to prove it is to make an archive in a UTF-8 environment and then if you extract it with p7z it works :)

You know, before I answered I tested all potential scenarios; that's the difference between you and me! At least I test stuff out before comming big and bad and trying to move people to ugly hacks that actually don't fix anything, instead they only sweep the trash under the rug. How can that help users ? :) Let me guess... now we tell others what software they should use ? :)

gapan commented 11 years ago

Hi dude. Troll much? Sorry, I won't bite.

mahiuchun commented 11 years ago

As I tried with unzip 6.0, it doesn't even able to list file names in Unicode enabled ZIP archives correctly. But 7z can handle Unicode enabled ZIP archives with no problem.

Beta version of unzip may be better, but its developer said it has known issues and the time of next release is unknown.

For non-Unicode enabled ZIP archives, lsar/unar is the best tool to use, as far as I can tell. lsar/unar has built-in support of auto encoding detection and encoding conversion. http://code.google.com/p/theunarchiver/

Yanpas commented 9 years ago

What about adding option to engrampa settings which tool to use? By default it would be 7z, but you may switch easily to unzip

gapan commented 8 years ago

Newer p7zip (at least version 15.14.1) autodetects the correct encoding and converts it to utf8 properly. So, I think this can be closed now.

Probably also fixes #102

Yanpas commented 8 years ago

Ubuntu 16.04 uses 9.20.1~dfsg.1-4.2, where did you manage to find this version?)

By the way, will engrampa provide backend choose or not?

gapan commented 8 years ago

@Yanpas I compiled it myself. Sorry, I don't use ubuntu and I cannot give instructions for it.

But I realize that maybe I was too quick to close this. It works with the file I posted here and similar ones, but not others. Also, the issue had turned from "make unzip the default" to "provide the option to choose with unpacker to use". I'm reopening...

unxed commented 7 years ago

.debs for p7zip and p7zip-full 16.02 are here:

x32 https://packages.debian.org/sid/i386/p7zip/download https://packages.debian.org/sid/i386/p7zip-full/download

x64 https://packages.debian.org/sid/amd64/p7zip/download https://packages.debian.org/sid/amd64/p7zip-full/download

it works ok with the file attached above, but fails on my attached example.

unzip 6.0 work ok for me with both files. case2.zip

btw, do anyone know how to file a bug to p7zip about this?

Yanpas commented 7 years ago

@unxed probably here https://sourceforge.net/projects/p7zip/support?source=navbar

unxed commented 7 years ago

corresponding p7zip's issue: https://sourceforge.net/p/p7zip/bugs/187/

Btw, windows version of 7za 16.03 under wine handles my case2.zip correctly (but fails with 23-10-2012-b-fasi-eaep.zip, corresponding wine's issue: https://bugs.winehq.org/show_bug.cgi?id=41411). Same with 16.02 under wine. So the problem is seen only in linux build.

Upd: got an answer from p7zip:

It uses OEM (DOS) encoding. p7zip doesn't support it.

But even modern windows versions write .zip files with OEM-encoded filenames!

txtsd commented 7 years ago

How difficult would it be to put 7z at the end of the priority list for opening archives? Engrampa is extremely slow at opening regular archives when I have p7zip installed. If it's just a few lines of if-elses, I can try to look into it and submit a PR.

alkisg commented 7 years ago

There are some persons here claiming that this isn't an actual issue and that people should instead create proper utf-8 based .zip files etc. Here is an example that should prove how valid this issue is, i.e. just one of the ways users still create .zip files with OEM encoding.

I open a fully updated Greek Windows 10 installation. I right click on the desktop, select "New text file", and I get this filename: Νέο έγγραφο κειμένου.txt Then I right click on that file, select "Send to zip", and name the zip file "win10test.zip". I'm attaching it here: win10test.zip

Then I try to uncompress it with unzip/7za, and I get the following results: 7za l win10test.zip: 2017-05-22 08:03:38 ....A 0 0 â¦ â¨¦ ¡ £â¤¦¬.txt unzip -l win10test.zip: 0 2017-05-22 08:03 Мтж тЪЪиШнж бЬагтджм.txt unzip -l -O cp737 win10test.zip: 0 2017-05-22 08:03 Νέο έγγραφο κειμένου.txt

Obviously, only the last try is correct. And since we're not able to fix Windows, or train millions of Windows users not to use the embedded zipping tools, or replace the millions of .zip files our there that already use the OEM charset, we should fix it in our side.

So to make engrampa correctly open .zip files, I'm doing the following workarounds in my installations: 1) sudo chmod -x /usr/bin/7z /usr/bin/7za 2) Create a wrapper in /usr/local/bin/unzip, that runs export UNZIP="-O $charset"; export ZIPINFO="-O $charset" before exec'ing /usr/bin/unzip.

It would be most useful if engrampa had some option that would allow me to more properly apply my workaround without having to chmod -x system files, for example an option to define the preferred order for the various zip tools that are installed.

TomaszGasior commented 7 years ago

I would like to use Unzip over than 7zip. It would be nice if user have ability to change this by dconf setting. If Engrampa uses Unzip while creating or unpacking archive, progress bar in GUI is more accurate — it shows acutal amount of files instead of "please wait" message.

alkisg commented 4 years ago

Hi, this is still an issue with the MATE 1.24, p7zip 16.02, and Windows 10 v2004 right click > create zip. The p7zip developer has replied "use 7zip via wine instead", which I believe is a strong reason to prefer unzip over p7zip, but anyway, ...

I submitted a pull request that allows to configure engrampa to prefer unzip, by specifying the UNZIP="-O cp737" environment variable.

unxed commented 4 years ago

Why can not engrampa detect appropriate DOS encoding by system locale settings and set environment variables automatically?

alkisg commented 4 years ago

That would be a patch for unzip, not for engrampa. There have been various efforts for that for more than 10 years, but none of them became mainstream enough to reach the distributions. But the "-O charset" environment variables are supported.

We only ask from engrampa to allow us to prefer unzip, at least temporarily; engrampa code shouldn't bother with encodings at all.

unxed commented 4 years ago

@alkisg thanks, maybe you can tell where unzip development is happening? Not ubuntu package, but unzip source code itself.

alkisg commented 4 years ago

Note that the Ubuntu link was just one of many; see a similar example for archilinux here, and I think most distros have something similar.

unxed, if you're planning to add support for codepage autodetection to unzip, then it would probably be best to add it to p7zip instead. Then engrampa wouldn't need to be fixed at all, it could just continue to prefer p7zip.

One way to quickly find the upstream for packages, is to visit their debian pages, and click on the "homepage" link to the right:

https://tracker.debian.org/pkg/unzip => http://www.info-zip.org/UnZip.html (currently seems down) https://tracker.debian.org/pkg/p7zip => http://p7zip.sourceforge.ne

unxed commented 4 years ago

@alkisg thanks again, but info-zip.org seems to be down.

As for p7zip, https://sourceforge.net/p/p7zip/bugs/187/

Probably p7zip developer doesn't think that this feature is too important, Or it can be difficult to implement.

Should I hope the patch will be accepted if developer not even interested in the implementation?

alkisg commented 4 years ago

unxed, well, as I mentioned above, I had reported it to the p7zip mailing list before the p7zip 187 bug, and the developer ended up saying "use wine".

The unzip upstream status as you saw was even worse.

That's why this issue has been standing for 10+ years; if a good solution existed, or if upstream was more interested, we wouldn't need an engrampa option for this at all.

Nevertheless, if you manage to send a good small patch to p7zip, it might be accepted. You can ask there before you start working on this.

But this 3-line patch for engrampa that I proposed, will at least make things tolerable, without being intrusive. I propose that we focus on that.

unxed commented 4 years ago

Well, I am not sure 3-line patch is even possible here. Looked inside wine's code to see how it detects OEM code pages, its something non trivial.

Btw,

unzip -O cp`wineconsole --backend=curses chcp.com | grep -Eo 'page: [0-9]+' | grep -Eo '[0-9]+'` sample.zip

works fine (but you need the whole wine just to open archive, so weird...)

alkisg commented 4 years ago

The 3 line patch allows engrampa users to manually specify the encoding. They do not attempt to autodetect the charset. As the screenshots in my PR show, it works fine.

If I get e.g. a Russian cp866 encoded .zip,, I'll need to manually specify the code page, I wouldn't want any autodetection to use my Greek cp737 one. So manual=necessary step 1, and autodetection=optional, heuristic, difficult step 2.

I fear that if we focus on autodetection, we'll just end up not having the manual method either.

unxed commented 4 years ago

The 3 line patch allows engrampa users to manually specify the encoding.

"good small patch to p7zip" I mean, sorry :) 3-line-patch to engrampa is ok! First manual step is ok also!

unxed commented 4 years ago

Wrote a small utility to implement OEM code page autodetection from system locale. Based on wine code. It's currently a shell script, but it can be easily converted to .c for use inside unzip or p7zip code.

unxed commented 4 years ago

Here is unzip from sourceforge patched for proper code page auto detection. Patch itself. Issue on sourceforge.

Make with make -f unix/Makefile generic in source tree root folder.

unxed commented 4 years ago

p7zip with the similar OEM code page auto detection patch applied. Patch itself. Issue on sourceforge.

gapan commented 4 years ago

@unxed I'm not that certain that correlating system locale with file codepage is the right thing to do here. For example, my current locale is en_US.utf8, but I definitely want files encoded with CP737 to be unzipped properly. Similarly, even if my locale was el_GR.utf8, I would want files encoded with a Russian PC to extract properly in mine.

@alkisg your solution looks good to me.

On a side note, there is also the unarchiver, which detects the correct codepage most of the time, but not always. You can also manually set the codepage with it. I'm sure I have a patch for engrampa that adds support for it somewhere. I'll try to find it.

sc0w commented 4 years ago

@gapan

I'm sure I have a patch for engrampa that adds support for it somewhere. I'll try to find it.

you have the gsettings key unar-open-zip

gapan commented 4 years ago

@sc0w hah, that's nice. When was this introduced? I created my patch for an older version, many years ago, probably something like 1.16, or even 1.12. Good thing it's not needed now.

sc0w commented 4 years ago

@gapan since 1.22.0 with https://github.com/mate-desktop/engrampa/commit/c587ae127cf0d6fee6b7e8c21bbe1c903faac13c

unxed commented 4 years ago

unar option is great, thanks! works good with russian, but still fails with win10test.zip from here

tested encoding detection lib uchardet for possible use with unzip/p7zip

it works just the same: ok with russian, not ok with win10test.zip same with enca same with chardet

maybe libchardet will show better results, have not tried yet

there also is charset detector from ICU - one more solution to try

unxed commented 4 years ago

btw, how does windows internal zip processing work? if it creates non-UTF8 zip files (for sure it does), where should be some trick for them to be opened correctly on other windows systems

alkisg commented 4 years ago

I applaud unxed's approach.

Previously, people were trying to autodetect the .zip charset from the ord(c) of the characters in the .zip filenames, as certain bytes are only used in specific charsets. This is very hard, and it might be completely impossible in some cases, where the ord() of the filename characters are valid in more than one charsets. Autodetection is NOT the correct way to go.

Unxed just maps the system locale to the OEM charset. This is not autodetection, but it's what Windows does, and it's more than good enough. I had proposed such a solution for unzip years ago, but I had never made a patch for it.

@gapan, if your system is en_US.UTF-8 and you want to uncompress a problematic .zip, you'd launch engrampa with the Greek locale:

LANG=el_GR.UTF-8 engrampa

To elaborate: most Greek users will use the Greek locale and will be able to properly unzip .zip files with unxed's solution without having to configure anything at all. A few Greek users will prefer the English locale, and these users will need to run engrampa with a Greek interface, to unzip Greek .zip files generated on Windows. This is completely normal in my opinion, but if we wanted to provide an engrampa option for that, we could have a dialog or gsetting "Run unzip/p7zip with this locale: el_GR.UTF-8".

@unxed, Windows has the same problem too; i.e. when I wanted to properly unzip a French .zip file, I had to temporarily change my Windows locale to French to be able to properly unzip it. Let's hope your solutions get accepted upstream! If so, there's no need to change engrampa at all.

So... at this point, I think we should focus on getting the p7zip patch accepted!

unxed commented 4 years ago

@gapan with my patches you can also use alias 7z='LC_ALL=el_GR.UTF-8 7z' alias 7z='LC_ALL=ru_RU.UTF-8 7z' to process all .zip archives using given locale Remember you should have to do

sudo locale-gen el_GR.UTF-8
sudo locale-gen ru_RU.UTF-8

for this to work.

unxed commented 4 years ago

Maybe we can somehow force those patches to be added on a build stage when building unzip/p7zip packeges for ubuntu? There are dozen of issues on launchpad discussing related problems and even blueprint for unzip to detect encoding of filenames.

alkisg commented 4 years ago

If we don't hear back from the p7zip developer in a reasonable timeframe, we can file disto-specific bug reports, and ask them to incorporate the patches.

The best is to file such an issue against Debian, and put the patch in debian/patches, so that it's applied on build, yes; then all derivative distributions like Ubuntu will get them.

unxed commented 4 years ago

@alkisg sounds good! If so, can I please ask you (and all other people here also :) to test patched p7zip on your system with different archives to make sure I don't missed anything? I will do the same.

unxed commented 4 years ago

Built .deb with patched p7zip for testing purposes. NB, it will overwrite your existing p7zip installation, so install with sudo dpkg -i p7zip-oemcp.deb, uninstall with sudo apt remove p7zip-oemcp && sudo apt reinstall p7zip-full p7zip. This is amd64 package built on Mint 19.03 so it should work on Ubuntu 18.04 and later also.

Engrampa processes file names in OEM charset like a charm with this package installed on my system.

PS: p7zip patch contained two bugs, now they are fixed and everything is working perfectly. Sourceforge issue updated.

alkisg commented 4 years ago

I verify that the patch works fine in Ubuntu 20.04. I will upload patched p7zip debs in the Greek schools PPA in a couple of days.

I'll close the PR as this is a much better solution; let's keep this issue still open for coordinating things though! :)

Thanks again unxed!

unxed commented 4 years ago

It's honor for me that my small piece of code will benefit the education in Greece!

unxed commented 4 years ago

Maybe this issue should be renamed to something like "fix handling of non-utf8 file names inside .zip archives"?

alkisg commented 4 years ago

@unxed, do you think a function like the following would make the patches prettier and easier to accept?

https://gist.github.com/alkisg/13b6ead3fd0d3d827b1f52976b07057d

unxed commented 4 years ago

@alkisg that's better, of course! Still some issues:

1) It we want to do things right, why should we define CP437 three times? We can use lc_to_cp_table[91] instead. 2) Good idea is to add a check for empty input, like if (lc_len == 0). Without such check lc_to_cp() returns the first cp, 850, instead of default 437 if input is empty. 3) Not sure if we should interpret partial locale values like "ru". This probably means something goes wrong outside of lc_to_cp(), so better idea is to fall back to default encoding in that case. So we need additional condition like (lc_len == strlen(lc_to_cp_table[i])) 4) Adding functional code to object-oriented code of p7zip looks weird, that's why I wrote it as inline code, not function. But your code can be easily adopted for inline use :)

Also I found several issues with this CP detection method itself 1) There are two CPs for "az_AZ" in our table. Not sure how to distinguish Azeri Cyrillic from Azeri Latin for that locales. Same with "sr_RS" and "uz_UZ". Maybe we should parse some environment variable to select Cyrillic or Latin in that cases? Also static analysis (as it is done in charset detection libs) have some chances to distinguish Latin from Cyrillic. 2) iconv does not support CP720, we should implement transcoding ourselfs to support arabic languages. It should not be very hard. cp720 chars are

\x80\x81éâ\x84à\x86çêëèïî\x8d\x8e\x8f\x90\u0651\u0652ô¤ـûùءآأؤ£إئابةتثجحخدذرزسشص«»░▒▓│┤╡╢╖╕╣║╗╝╜╛┐└┴┬├─┼╞╟╚╔╩╦╠═╬╧╨╤╥╙╘╒╓╫╪┘┌█▄▌▐▀ضطظعغفµقكلمنهوىي≡\u064b\u064c\u064d\u064e\u064f\u0650≈°∙·√ⁿ²■\u00a0

(from https://github.com/ashtuchkin/iconv-lite/pull/221) so we can just use (character code - 128) as index in that string to get utf-8 char

unxed commented 4 years ago

Tested with Serbian. Locales for Cyrillic and Latin are different. For cyrillic, its sr_RS. For latin:

LANG=sr_RS.UTF-8@latin
LANGUAGE=sr_RS
LC_CTYPE="sr_RS.UTF-8@latin"

But I can't find different Latin/Cyrillic locales for Azeri and Uzbek.

alkisg commented 4 years ago

About "ru", many people define LANG=el because they get the wrong impression from LANGUAGE=el. If this value is in the environment even by mistake, there's no harm in using it. That's why I'm only checking for lc_len and not for all of the string.

About CP720, I think that should be a bug against iconv, I don't think it should be addressed by this issue here...

About CP437 two times, I restructured it so that it's only used one time. I don't think saving a few bytes is worth it though, readability and maintainability are more important.

Good work!

mate-desktop / engrampa

Prefer to open zip files with unzip instead of 7z #5