google / leveldb

LevelDB is a fast key-value storage library written at Google that provides an ordered mapping from string keys to string values.
BSD 3-Clause "New" or "Revised" License
36.39k stars 7.8k forks source link

Unicode filename handling in new windows environment #755

Open laanwj opened 4 years ago

laanwj commented 4 years ago

The new windows environment uses the A functions instead of the W ones:

  size_t error_text_size = ::FormatMessageA(
    ScopedHandle handle = ::CreateFileA(
        ::CreateFileA(filename.c_str(), desired_access, share_mode,
        ::CreateFileMappingA(handle.get(),
    ScopedHandle handle = ::CreateFileA(
    ScopedHandle handle = ::CreateFileA(
    return GetFileAttributesA(filename.c_str()) != INVALID_FILE_ATTRIBUTES;
    HANDLE dir_handle = ::FindFirstFileA(find_pattern.c_str(), &find_data);
    } while (::FindNextFileA(dir_handle, &find_data));
    if (!::DeleteFileA(filename.c_str())) {
    if (!::CreateDirectoryA(dirname.c_str(), nullptr)) {
    if (!::RemoveDirectoryA(dirname.c_str())) {
    if (!::GetFileAttributesExA(filename.c_str(), GetFileExInfoStandard,
    if (::MoveFileA(from.c_str(), to.c_str())) {
    if (::ReplaceFileA(to.c_str(), from.c_str(), /*lpBackupFileName=*/nullptr,
    ScopedHandle handle = ::CreateFileA(
    if (!GetTempPathA(ARRAYSIZE(tmp_path), tmp_path)) {

I think these use what the process code page happens to be, which might not be UTF-8. I don't claim to understand Windows unicode handling, so please tell me I'm wrong, but in our previous Windows environment we used

void ToWidePath(const std::string& value, std::wstring& target) {
    wchar_t buffer[MAX_PATH];
    MultiByteToWideChar(CP_UTF8, 0, value.c_str(), -1, buffer, MAX_PATH);
    target = buffer;
}

void ToNarrowPath(const std::wstring& value, std::string& target) {
    char buffer[MAX_PATH];
    WideCharToMultiByte(CP_UTF8, 0, value.c_str(), -1, buffer, MAX_PATH, NULL, NULL);
    target = buffer;
}

along with the *W wide variants of Windows API functions (but https://github.com/google/leveldb/issues/755#issuecomment-559721707 is a better way using std::codecvt).

I'm seeing some downstream test failures regarding file names, and I suspect this might be the cause.

sipsorcery commented 4 years ago

I've spent some time on this issue exploring whether it's possible to get this library to create a directory with unicode characters on Windows. The question boils down to whether the win32 CreateDirectoryA call can create unicode paths.

I used this simple sample to check:

#include <Shlobj.h>
#include "Windows.h"
#include <io.h>
#include <fcntl.h>

#include <codecvt>
#include <locale>
#include <iostream>
#include <string>
#include <vector>

int main()
{
  std::string aPath = u8"c:/temp/17398/CreateDirectoryA_₿_🏃_20191128_104644";
  BOOL createAResult = CreateDirectoryA(aPath.c_str(), 0);
  std::cout << "create dir result " << createAResult << std::endl;

  std::wstring wPath = L"c:/temp/17398/CreateDirectoryW_₿_🏃_20191128_104644";
  BOOL createWResult = CreateDirectoryW(wPath.c_str(), 0);
  std::cout << "create dir result " << createWResult << std::endl;
}

The results were:

u3

Maybe there's some trick to get the string parameter to CreateDirectoryA in a different encoding? I couldn't find it. I did try the new std library converter class but it didn't help.

using convert_type = std::codecvt_utf8<wchar_t>;
std::wstring_convert<convert_type, wchar_t> converter;
std::string narrowPath = converter.to_bytes(widePath);

In the end I did find a way by setting a Windows Region setting: Control Panel->Region->Administrative->Change System Locale->Beta: Use Unicode UTF-8 for worldwide language support. Checking that option (and restarting Windows) and re-running the sample resulted in:

u2

I believe that setting is only available on recent versions of Windows 10.

From reading the commit message on env_windows.cc and the associated issue it seems that CreateDirectoryA and the other win32 file system calls are using the ANSI methods in order to maintain compatibility with other OS using narrow string paths. Although I note DeleteFile can end up defined as or using DeleteFileA or DeleteFileW depending on the header includes.

I suspect the only way to robustly fix this issue is to use CreateDirectoryW and the other unicode methods instead for the Windows port instead of the ANSI ones.

laanwj commented 4 years ago

Thanks for investigating!

I believe that setting is only available on recent versions of Windows 10.

Yea this is likely the setting that can also be enabled per binary described in https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page , if so, it's only in windows versions since last year. That's not acceptable for our use.

are using the ANSI methods in order to maintain compatibility with other OS using narrow string paths.

Right, that's a fair reason, std::wstring-based APIs are unwieldy. In our case on all other OS we use UTF-8 for paths. It would be very convenient to be able to do so on Windows too.

Sure, exposing a std::wstring-based API (say, parametrize the filename type on OS) would work as well, then we'd do the UTF-8 conversion at our side, but would result in platform specific client code.

So I prefer the approach of reintroducing ToWidePath/ToNarrowPath internally in the windows_env.

I suspect the only way to robustly fix this issue is to use CreateDirectoryW and the other unicode methods instead for the Windows port instead of the ANSI ones.

I suspect so too.

sipsorcery commented 4 years ago

So I prefer the approach of reintroducing ToWidePath/ToNarrowPath internally in the windows_env.

One way to do this using the std::codecvt and that I've checked works for me is:

  std::string utf8Path = u8"c:/temp/17398/CreateDirectory_₿_🏃_20191128_104644"; // const char*, encoded as UTF-8
  std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
  std::wstring wideUtf8Path = converter.from_bytes(utf8Path);
  BOOL createWResult = CreateDirectoryW(wideUtf8Path.c_str(), 0);
  std::cout << "create dir result " << createWResult << std::endl;
laanwj commented 4 years ago

One way to do this using the std::codecvt and that I've checked works for me is:

Nice; yes, C++ idiom is better.

pwnall commented 4 years ago

If you have your own working Windows environment, there is no need to switch to the one supplied with LevelDB! The API contract is the same, and everything should still work.

I don't fully understand the issue of A vs W functions in Win32, and as far as I know there are benefits to each alternative. The Windows environment that ships with LevelDB uses *A because it was extracted from Chrome.

If the *W variant works better for you, you're welcome to fork the environment and make the changes you need! The interface between Env and the rest of LevelDB is part of the API, and is not expected to change.

laanwj commented 4 years ago

The problem is that the *A functions cannot represent the entire unicode character set.

They only take characters that happen to be part of the configured code page.

This has turned out to be a problem for Chinese users mostly, and when taking arbitrary filenames through Qt or JSON-RPC. Sometimes even the home path gives problems (Chinese users on systems with some European codepage configured).

I understand that exposing wstring on the API is not doable, but passing in utf-8 would be useful, which wouldn't require type changes

In any case, it seems we'll have to continue carrying patches for this. The hope was that it was possible to eventually switch to upstream.

sipsorcery commented 4 years ago

Chromium looks to be using the Unicode (*W) versions of the win32 API's, at least according to this include.

I'm no expert on Windows Unicode handling, *A vs *W etc. but what I have managed to decipher is that the *A versions will use the Windows system code page to interpret strings. On my machine the system code page, which I have never manually set, is CP1252.

An example of what happens in practice can be seen by passing a string of "testw🏃.txt" and "testa🏃.txt" to CreateFileW and CreateFileA respectively.

For CreateFileW file name created matches the input string. The raw bytes of the string supplied to the CreateFileW call are:

const wchar_t * {  0x74, 0x65, 0x73, 0x74, 0x77, 0xd83c, 0xdfc3, 0x2e, 0x74, 0x78, 0x74}

For CreateFileA the file name ends up as "testa🏃.txt" and is different because the CP1252 code page doesn't recognise the multibyte unicode character and treats each byte as a distinct code point. The raw bytes of the string supplied to the CreateFileA call are below. This raw bytes are obtained by calling WideCharToMultiByte with a code page of CP_UTF8 on the previous array.

const char * { 0x74, 0x65, 0x73, 0x74, 0x61, 0xf0, 0x9f, 0x8f, 0x83, 0x2e, 0x74, 0x78, 0x74}

The CreateFileA call will work correctly with the exact same input IF the system code page is set to CP_UTF8 or some other trick is used to set the application code page like this.

For leveldb (and every other library trying to deal with the Windows Unicode quagmire) the choices seem to be:

  1. Keep using the *A calls and end up with broken strings if the consumer application supplies a setting that uses characters not supported by the runtime machine's code page. Note this is now very easy to do because of the number of applications that do properly support UTF-8. For example if a config file is edited in Notepad/Notepad++ paths can be easily set with UTF-8 chars. At a guess this may resolve itself in the coming years as Microsoft push towards setting the default code page to CP_UTF8.

  2. Switch to the *W calls and fully support any path supplied by a consuming application irrespective of the runtime machine's code page. The burden is the extra bit of plumbing code to convert the utf8 narrow strings to *W compatible wide strings. For leveldb that would seem to be isolated to env_windows.cc and can be done easily with std::codecvt or MultiByteToWideChar.

farfella commented 4 years ago

Your analysis is correct. To add, using the OS provided MultiByteToWideChar and WideCharToMultiByte functions is more correct than using codecvt. codecvt_utf8/utf16/utf8_utf16 are deprecated in C++17 (https://en.cppreference.com/w/cpp/locale/codecvt), and in fact except with certain valid input.