rikyoz / bit7z

A C++ static library offering a clean and simple interface to the 7-zip shared libraries.
https://rikyoz.github.io/bit7z
Mozilla Public License 2.0
633 stars 116 forks source link

[Bug]: Some zip file names are garbled #207

Closed lz990377023 closed 4 months ago

lz990377023 commented 5 months ago

bit7z version

4.0.x

Compilation options

BIT7Z_7ZIP_VERSION

7-zip version

v23.01

7-zip shared library used

7z.dll / 7z.so

Compilers

Clang

Compiler versions

No response

Architecture

x86_64

Operating system

macOS

Operating system versions

No response

Bug description

截屏2024-05-04 13 27 52 截屏2024-05-04 13 33 17

Hello, I used mac to decompress zip and found some Chinese garbled characters in the package, I don't know what the cause is, is there a problem with the string decoding Settings

Steps to reproduce

No response

Expected behavior

No response

Relevant compilation output

No response

Code of Conduct

rikyoz commented 5 months ago

Hi!

is there a problem with the string decoding Settings

As far as I know, there shouldn't be any problem. 7-Zip always uses wide strings internally, and that bstrValue is exactly the wide string that 7-Zip reports to bit7z, without any modification. On Linux and macOS, bit7z then needs to convert/decode these wide strings into narrow strings; to do this, the library uses the C++ standard way of narrow string conversion, i.e:

std::wstring_convert< std::codecvt_utf8< wchar_t >, wchar_t > converter;
return converter.to_bytes( wideString, wideString + size ); // std::wstring to std::string

So I don't think that the problem is on bit7z's side, but I'll try to investigate it anyway! By the way, what is the expected string value in those screenshots?

lz990377023 commented 5 months ago

Thanks for your answer, I have tried to convert wstring to string, but it is still garbled 截屏2024-05-05 13 54 21 The correct text is the picture below 截屏2024-05-05 13 55 59

rikyoz commented 5 months ago

I have tried to convert wstring to string, but it is still garbled

Sorry, I meant that bit7z already performs the conversion, so there's no need to do a string -> wstring -> string conversion.

Anyway, I did some tests, and I was able to replicate the issue. It seems to happen only on Zip archives created using the macOS's native compress tool (the one from the right-click context menu). image image

If however the Zip archive is created via 7-Zip's CLI 7zz, the name of the item is correctly decoded: image image

As I said, I don't think this is a bug on bit7z side: it is the 7z.so that reports the item name differently in these cases, for some reasons.

However, the 7-Zip's CLI seems to perform some further string decoding with respect to the shared library, as the 7zz tool always displays the correct name: image

I'll investigate what decoding is performed by 7zz and try to implement it also in bit7z.

lz990377023 commented 5 months ago

Thanks again for your reply, I will also try to see the processing of 7zz and look forward to updating the string decoder to bit7z

rikyoz commented 4 months ago

So, I investigated the problem further, and I found that this is a known issue with Zip archives created with the compression tool of macOS:

https://github.com/weichsel/ZIPFoundation/issues/63 https://github.com/gildas-lormeau/zip.js/issues/131 https://stackoverflow.com/questions/13261347/correctly-decoding-zip-entry-file-names-cp437-utf-8-or

In short, the macOS zip utility uses UTF-8 for filenames, but it doesn't set the UTF-8 bit flag in the Zip file.

I've found a possible fix to make 7-Zip correctly handle such Zip archives: use an UTF-8 locale.

// Before calling .path() or .name() on the item object
std::locale::global(std::locale("en_US.UTF-8"));

image

My guess is that since 7-Zip does not read the UTF-8 flag, it interprets the filenames using the current locale's encoding, which may not be UTF-8. Since 7-Zip uses wide strings internally, and these are usually UTF-32 encoded on macOS/Linux, 7-Zip does a conversion from the locale's encoding (possibly not UTF-8) to UTF-32, causing the garbled characters since the original encoding was actually UTF-8. The 7zz tool solves this by setting the locale to en_US.UTF-8, without any special string decoding as I originally thought: it simply converts from UTF-8 to UTF-32.

Unfortunately, I don't think there is a clean workaround that can be implemented within bit7z.