kyz / libmspack

A library for some loosely related Microsoft compression formats, CAB, CHM, HLP, LIT, KWAJ and SZDD.
https://www.cabextract.org.uk/libmspack/
169 stars 45 forks source link

locale-dependent pitfalls #15

Closed galtgendo closed 6 years ago

galtgendo commented 6 years ago

As long as the archive was produced in a way that stored the filenames as utf8 strings, things work fine.

Yet, if it's not the case, cabextract behaves as broken as zip does. Something like zip's iconv patch is needed - of course, the result would be requiring a libc that implements iconv. Also, if such archive was created is a DBCS locale, putting such name through tolower will only result in a completely broken return value.

galtgendo commented 6 years ago

While it would likely take a tailored archive, it looks like unix_path_seperators in DBCS locale can potentially return an incorrect result too.

galtgendo commented 6 years ago

a sort-of-working hack (I'm yet again reminded how horribly stupid this interface is)

Anyway, this is sort of works for me...really bad code though, also, in case of a partially completed conversion likely broken wrt. path separators.

galtgendo commented 6 years ago

updated hack ...a slightly less broken hack - separators should be now OK in case of partial conv

kyz commented 6 years ago

Thanks for reminding me of this. I actually have a well developed patch from Kevin Tardif for setting the (non-UTF8) filename encoding, but so far I haven't applied it for want of testing.

I'll look into applying it and testing it.

galtgendo commented 6 years ago

Looking forward to that patch. Just wonder how will you go about unix_path_seperators, cause heuristics: either they work or they get things even more broken.

kyz commented 6 years ago

If an encoding is specified, the CAB filenames that are not UTF8 will be converted from their given encoding to UTF-8 using iconv(), and the UTF8 flag will be set on them.

At that point, unix_path_separators() and create_output_name() will proceed as they would with UTF8 characters. Slash and backslash are single-byte characters in UTF8.

kyz commented 6 years ago

I've committed several changes to cabextract:

Could you try out these changes?

Iif you have any sample cabinet files with non-UTF8, non-ASCII, non-ISO-8859-1 filenames, I'd appreciate a copy. I haven't found any CAB creating software that generates these type of filenames.

If you know of any software that does create these, I'd also like to know about it too.

galtgendo commented 6 years ago
+   for (i = 0; i < (sizeof(locales)/sizeof(*locales)); i++) {
+      if (setlocale(LC_CTYPE, locales[i])) break;
+   }

This makes me a bit twitchy: as it's for purpose of tolower/towlower, I'm nearly sure there's at least one locale where that's incorrect...then again, same could be said about the current code. I've been shown once just why that monster called ICU is sometimes necessary ... I still groan about it.

Also, while you were already doing it in the code with utf8 names, I consider broken filenames a better solution than ones filled with replacement char, as the former at least gives you a minimal chance of recovering the original name (also, a lower chance of a name clash - pretty much only if the archive is corrupted).

Other than that, the code looks like something that should work, given that you're the author of libmspack too.

kyz commented 6 years ago

I think these are matters of opinion; the reason the replacement character is used is because 1) it's not useful to promise to convert to UTF8 and then generate invalid UTF8 (if you want the original filenames, just don't use --encoding) and 2) it helps pinpoint what went wrong, Filenames like br�k, br��k, b�r�k are much easier to debug than brk, brk, brk. It also removes the possibility of a bug where you end up with a blank filename.

All of the locales attempted with setlocale() are chosen for their UTF8 character type, only LC_CTYPE is set, cabextract doesn't use any other locale-dependent functions. It might not always be successful, because the user's system might not have any of those locales, but in that situation a POSIX compliant setlocale() will still leave the program in the "C" locale, so the situation didn't get any worse.

With glibc, all UTF8 locales have the same character types, except tr_TR which lowercases I to ı instead of i. I think trying to support that would introduce more problems than not supporting it. People usually work with cab files created by someone else, somewhere else, so adapting to their local settings is more likely to break things.

Talking of cabinet files, I still have still seen NO cabinet files in the wild that are non-ASCII and non-UTF8. If you have any, I would greatly appreciate them.

So far this feature is entirely theoretical. The Microsoft cab creation tools always encode non-ASCII filenames as UTF8, so I'm not sure if there are any cab files that would need this new --encoding option. I would love to be proven wrong.

kyz commented 6 years ago

Thanks for sending some cabinet files over email. cabextract 1.7 was just released with the --encoding option to override local and UTF8 encodings, and I added a test suite with some synthetic cab files using non-UTF8 non-ASCII filename encodings, to ensure the feature works. Thanks for your input!