rikyoz / bit7z

A C++ static library offering a clean and simple interface to the 7-zip shared libraries.
https://rikyoz.github.io/bit7z
Mozilla Public License 2.0
602 stars 110 forks source link

[Bug]: bit7z doesn't support UTF-8 for zip filenames under Windows #171

Closed ra-dave closed 3 months ago

ra-dave commented 9 months ago

bit7z version

4.0

Compilation options

No response

7-zip version

v22.01

7-zip shared library used

7z.dll / 7z.so

Compilers

MSVC

Compiler versions

No response

Architecture

x86_64

Operating system

Windows

Operating system versions

Windows 10

Bug description

On Windows, bit7z errors when the zip filename has UTF-8 characters. The std::exception's ex.what() is:

No mapping for the Unicode character exists in the target multi-byte code page.

Steps to reproduce

Just try to create a zip file that has UTF-8 characters in the filename.

No response

Expected behavior

No response

Relevant compilation output

No response

Code of Conduct

rikyoz commented 9 months ago

Hi! Thanks for reporting the issue! Which version of MSVC are you using? I couldn't reproduce your issue, but I only tested with recent versions of MSVC.

However, while troubleshooting the problem, I found a few lines of code where path encoding wasn't correctly handled. It might be related to your issue.

I just pushed a commit to the develop branch that should fix those lines (d789fa664cefc094cc9b869a30ba6292b6d0961b). Hopefully, it should fix your issue, too.

ra-dave commented 9 months ago

Appreciate the quick response! I'm using VS 2022.

Just tried the new code, still getting the same crash with BitFileCompressor compress().

BitFileExtractor extract() can't handle a UTF-8 character in the zip filename either. That is the more critical one for me at the moment because I can workaround the BitFileCompressor side if I need to by creating a zip file with a temporary filename and then renaming it outside of bit7z. But trying to do something similar on the extract side is much more problematic (might not have permissions to rename the zip file, etc).

This is where it is crashing when I try extract():

image

image

My build settings:

C:\bit7z\build>cmake ../ -DCMAKE_BUILD_TYPE=Debug -- Selecting Windows SDK version 10.0.19041.0 to target Windows 10.0.19045. -- Standard filesystem: YES -- Target Version: 4.0.0 -- Compiler ID: MSVC -- Compiler Version: 19.35.32215.0 -- Architecture: x64 -- Build Type: Debug -- Language Standard for bit7z: C++17 -- Auto format detection: OFF -- Regex matching extraction: OFF -- Use std::byte: OFF -- Use native string: OFF -- Generate Position Independent Code: OFF -- Disable Zip ASCII password check: OFF -- 7-zip version: 23.01 -- Build tests: OFF -- Build docs: OFF -- Auto prefix long paths: OFF -- Use the default codepage: OFF -- Path sanitization: OFF -- Code analysis: OFF -- Static runtime: OFF -- CPM: Adding package 7-zip@23.01 (v23.01) -- 7-zip source code available at C:/bit7z/build/_deps/7-zip-src -- Configuring done -- Generating done -- Build files have been written to: C:/bit7z/build

rikyoz commented 9 months ago

I'm using VS 2022. Uhm, this is really strange, as I can't replicate the issue...

Just tried the new code, still getting the same crash with BitFileCompressor compress().

I just pushed some other minor fixes that should make the encoding handling even more robust. However, at this point, I don't think the problem is in bit7z...

BitFileExtractor extract() can't handle a UTF-8 character in the zip filename either. That is the more critical one for me at the moment because I can workaround the BitFileCompressor side if I need to by creating a zip file with a temporary filename and then renaming it outside of bit7z. But trying to do something similar on the extract side is much more problematic (might not have permissions to rename the zip file, etc).

I see. Do you have this issue only with zip files or also with other formats like 7z?

This is where it is crashing when I try extract():

Interesting! Bit7z internally uses MSVC's implementation of std::filesystem. In this particular case, it uses the function std::filesystem::u8path as it expects to construct a path from a UTF-8 string.

Now, MSVC's fs::path internally uses wide strings, so it needs to convert a UTF-8 string to a wide string.

The error message No mapping for the Unicode character exists in the target multi-byte code page. makes me think that the error is that the original string you pass to bit7z is not actually UTF-8 encoded and hence contains some invalid UTF-8 sequences, causing the exception.

One possible way to avoid the throwing of the exception would be to use bit7z's own string conversion functions like this:

   return fs::path{ bit7z::widen( str ) }; // instead of fs::u8path

However, if the original string is not UTF-8 encoded, this might not result in a correct path encoding.

Anyway, thank you for all the details you provided! 🙏

ra-dave commented 9 months ago

My zip filename seems to be properly UTF-8 encoded (based on the fact that it appears in the debugger's watch window correctly when I use ",s8" modifier.

I see that you have some automated tests for testing files in the archive that are UNICODE characters but are there any tests for testing the actual archive filename itself? (for Windows)

rikyoz commented 9 months ago

My zip filename seems to be properly UTF-8 encoded (based on the fact that it appears in the debugger's watch window correctly when I use ",s8" modifier.

Uhm ok, strange! I'll need to investigate this further, even though without replicating is a bit difficult to understand what the actual problem is!

I see that you have some automated tests for testing files in the archive that are UNICODE characters but are there any tests for testing the actual archive filename itself? (for Windows)

Actually, yeah, here and here the automated tests do exactly that: reading an archive with Unicode characters in its name!

ra-dave commented 9 months ago

Uhm ok, strange! I'll need to investigate this further, even though without replicating is a bit difficult to understand what the actual problem is!

It's easy to replicate. Just use a UTF-8 encoded std::string for the archive filename, in Windows.

rikyoz commented 9 months ago

It's easy to replicate. Just use a UTF-8 encoded std::string for the archive filename, in Windows.

Yeah, the problem is that when I try to replicate it, it works!

Just for context, I'm testing everything in a clean installation of Windows 10 inside a VM with the default regional and language settings:

The only things installed are 7-zip, Visual Studio 2022, CMake, and Git. I cloned bit7z in the Documents folder, and created a CMake project in a separate folder called bit7z-test:

cmake_minimum_required( VERSION 3.8.2 FATAL_ERROR )
project( bit7ztest )

set( CMAKE_CXX_STANDARD 11 )
set( CMAKE_CXX_STANDARD_REQUIRED ON )

add_executable( bit7ztest main.cpp )

add_subdirectory( ${PROJECT_SOURCE_DIR}/../bit7z/ ${CMAKE_CURRENT_BINARY_DIR}/bit7z )
target_link_libraries( bit7ztest PRIVATE bit7z )
target_compile_options( bit7ztest PRIVATE /utf-8 )

The test files (to be compressed) are in the test_unicode folder in the project's folder:

I tested the compression, and it works just fine:

#include <iostream>

#include <direct.h>

#include <bit7z/bit7zlibrary.hpp>
#include <bit7z/bitexception.hpp>
#include <bit7z/bitfilecompressor.hpp>

using namespace bit7z;

auto main() -> int {
    std::cout << "Current Code Page: " << GetACP() << std::endl;
    _chdir(R"(C:\Users\Riccardo\Documents\bit7z-test\)");

    try {
        Bit7zLibrary lib{ R"(C:\Program Files\7-Zip\7z.dll)" };
        BitFileCompressor compressor{ lib, BitFormat::Zip };
        compressor.compressDirectory( "test_unicode/", u8"SottoMenù.zip" );
    } catch (const bit7z::BitException& ex) {
        std::cerr << ex.what() << std::endl;
    }
    return 0;
}

I tested extraction too, from the same archive created in the previous test:

#include <iostream>

#include <direct.h>

#include <bit7z/bit7zlibrary.hpp>
#include <bit7z/bitexception.hpp>
#include <bit7z/bitfileextractor.hpp>

using namespace bit7z;

auto main() -> int {
    std::cout << "Current Code Page: " << GetACP() << std::endl;
    _chdir(R"(C:\Users\Riccardo\Documents\bit7z-test\)");

    try {
        Bit7zLibrary lib{ R"(C:\Program Files\7-Zip\7z.dll)" };
        BitFileExtractor extractor{ lib, BitFormat::Zip };
        extractor.extract( u8"SottoMenù.zip", "test_extract/" );
    } catch (const bit7z::BitException& ex) {
        std::cerr << ex.what() << std::endl;
    }
    return 0;
}

Again, it worked without issues:

All the tests were performed using the develop branch of bit7z.

rikyoz commented 9 months ago

The only way I found to replicate your issue is by forcing the encoding of the string literal to be Windows-1252:

compressor.compressDirectory( "test_unicode/", "SottoMen\xF9.zip" );

But we already confirmed that you're using UTF-8 strings, so I'll need to investigate the issue further!

rikyoz commented 9 months ago

So, I've released a new maintenance version v4.0.2 that contains all the fixes to the UTF-8 support I made on the develop branch. I don't know if you tested all these changes, but they could fix your issue.

If the issue persists, could you provide a more detailed stack trace of where the exception happens? At least on bit7z's side. Also, are you using MSVC's /utf-8 option, or the /source-charset and /execution-charset options?

Any details you can provide will be really useful; I'm continuing to try to replicate it, but I haven't been lucky so far, so I really appreciate any help you can provide!

ra-dave commented 9 months ago

I'm still testing but it appears 4.0.2 fixed my issue (4.0.1 definitely still had the issue).

Thanks! I'll let you know if there are any issues but it's looking good.

rikyoz commented 9 months ago

I'm still testing but it appears 4.0.2 fixed my issue (4.0.1 definitely still had the issue). Thanks!

Great, you're welcome!

I'll let you know if there are any issues but it's looking good.

Thanks!