Open laanwj opened 4 years ago
I've spent some time on this issue exploring whether it's possible to get this library to create a directory with unicode characters on Windows. The question boils down to whether the win32 CreateDirectoryA
call can create unicode paths.
I used this simple sample to check:
#include <Shlobj.h>
#include "Windows.h"
#include <io.h>
#include <fcntl.h>
#include <codecvt>
#include <locale>
#include <iostream>
#include <string>
#include <vector>
int main()
{
std::string aPath = u8"c:/temp/17398/CreateDirectoryA_₿_🏃_20191128_104644";
BOOL createAResult = CreateDirectoryA(aPath.c_str(), 0);
std::cout << "create dir result " << createAResult << std::endl;
std::wstring wPath = L"c:/temp/17398/CreateDirectoryW_₿_🏃_20191128_104644";
BOOL createWResult = CreateDirectoryW(wPath.c_str(), 0);
std::cout << "create dir result " << createWResult << std::endl;
}
The results were:
Maybe there's some trick to get the string parameter to CreateDirectoryA
in a different encoding? I couldn't find it. I did try the new std library converter class but it didn't help.
using convert_type = std::codecvt_utf8<wchar_t>;
std::wstring_convert<convert_type, wchar_t> converter;
std::string narrowPath = converter.to_bytes(widePath);
In the end I did find a way by setting a Windows Region setting: Control Panel->Region->Administrative->Change System Locale->Beta: Use Unicode UTF-8 for worldwide language support. Checking that option (and restarting Windows) and re-running the sample resulted in:
I believe that setting is only available on recent versions of Windows 10.
From reading the commit message on env_windows.cc
and the associated issue it seems that CreateDirectoryA
and the other win32 file system calls are using the ANSI methods in order to maintain compatibility with other OS using narrow string paths. Although I note DeleteFile
can end up defined as or using DeleteFileA
or DeleteFileW
depending on the header includes.
I suspect the only way to robustly fix this issue is to use CreateDirectoryW
and the other unicode methods instead for the Windows port instead of the ANSI ones.
Thanks for investigating!
I believe that setting is only available on recent versions of Windows 10.
Yea this is likely the setting that can also be enabled per binary described in https://docs.microsoft.com/en-us/windows/uwp/design/globalizing/use-utf8-code-page , if so, it's only in windows versions since last year. That's not acceptable for our use.
are using the ANSI methods in order to maintain compatibility with other OS using narrow string paths.
Right, that's a fair reason, std::wstring-based APIs are unwieldy. In our case on all other OS we use UTF-8 for paths. It would be very convenient to be able to do so on Windows too.
Sure, exposing a std::wstring-based API (say, parametrize the filename type on OS) would work as well, then we'd do the UTF-8 conversion at our side, but would result in platform specific client code.
So I prefer the approach of reintroducing ToWidePath
/ToNarrowPath
internally in the windows_env
.
I suspect the only way to robustly fix this issue is to use CreateDirectoryW and the other unicode methods instead for the Windows port instead of the ANSI ones.
I suspect so too.
So I prefer the approach of reintroducing ToWidePath/ToNarrowPath internally in the windows_env.
One way to do this using the std::codecvt and that I've checked works for me is:
std::string utf8Path = u8"c:/temp/17398/CreateDirectory_₿_🏃_20191128_104644"; // const char*, encoded as UTF-8
std::wstring_convert<std::codecvt_utf8_utf16<wchar_t>> converter;
std::wstring wideUtf8Path = converter.from_bytes(utf8Path);
BOOL createWResult = CreateDirectoryW(wideUtf8Path.c_str(), 0);
std::cout << "create dir result " << createWResult << std::endl;
One way to do this using the std::codecvt and that I've checked works for me is:
Nice; yes, C++ idiom is better.
If you have your own working Windows environment, there is no need to switch to the one supplied with LevelDB! The API contract is the same, and everything should still work.
I don't fully understand the issue of A vs W functions in Win32, and as far as I know there are benefits to each alternative. The Windows environment that ships with LevelDB uses *A because it was extracted from Chrome.
If the *W variant works better for you, you're welcome to fork the environment and make the changes you need! The interface between Env and the rest of LevelDB is part of the API, and is not expected to change.
The problem is that the *A functions cannot represent the entire unicode character set.
They only take characters that happen to be part of the configured code page.
This has turned out to be a problem for Chinese users mostly, and when taking arbitrary filenames through Qt or JSON-RPC. Sometimes even the home path gives problems (Chinese users on systems with some European codepage configured).
I understand that exposing wstring on the API is not doable, but passing in utf-8 would be useful, which wouldn't require type changes
In any case, it seems we'll have to continue carrying patches for this. The hope was that it was possible to eventually switch to upstream.
Chromium looks to be using the Unicode (*W) versions of the win32 API's, at least according to this include.
I'm no expert on Windows Unicode handling, *A
vs *W
etc. but what I have managed to decipher is that the *A
versions will use the Windows system code page to interpret strings. On my machine the system code page, which I have never manually set, is CP1252.
An example of what happens in practice can be seen by passing a string of "testw🏃.txt" and "testa🏃.txt" to CreateFileW
and CreateFileA
respectively.
For CreateFileW
file name created matches the input string. The raw bytes of the string supplied to the CreateFileW
call are:
const wchar_t * { 0x74, 0x65, 0x73, 0x74, 0x77, 0xd83c, 0xdfc3, 0x2e, 0x74, 0x78, 0x74}
For CreateFileA
the file name ends up as "testaðŸƒ.txt" and is different because the CP1252 code page doesn't recognise the multibyte unicode character and treats each byte as a distinct code point. The raw bytes of the string supplied to the CreateFileA
call are below. This raw bytes are obtained by calling WideCharToMultiByte
with a code page of CP_UTF8
on the previous array.
const char * { 0x74, 0x65, 0x73, 0x74, 0x61, 0xf0, 0x9f, 0x8f, 0x83, 0x2e, 0x74, 0x78, 0x74}
The CreateFileA
call will work correctly with the exact same input IF the system code page is set to CP_UTF8
or some other trick is used to set the application code page like this.
For leveldb (and every other library trying to deal with the Windows Unicode quagmire) the choices seem to be:
Keep using the *A
calls and end up with broken strings if the consumer application supplies a setting that uses characters not supported by the runtime machine's code page. Note this is now very easy to do because of the number of applications that do properly support UTF-8. For example if a config file is edited in Notepad/Notepad++ paths can be easily set with UTF-8 chars. At a guess this may resolve itself in the coming years as Microsoft push towards setting the default code page to CP_UTF8
.
Switch to the *W
calls and fully support any path supplied by a consuming application irrespective of the runtime machine's code page. The burden is the extra bit of plumbing code to convert the utf8 narrow strings to *W
compatible wide strings. For leveldb that would seem to be isolated to env_windows.cc
and can be done easily with std::codecvt
or MultiByteToWideChar
.
Your analysis is correct. To add, using the OS provided MultiByteToWideChar
and WideCharToMultiByte
functions is more correct than using codecvt
. codecvt_utf8/utf16/utf8_utf16
are deprecated in C++17 (https://en.cppreference.com/w/cpp/locale/codecvt), and in fact except with certain valid input.
The new windows environment uses the A functions instead of the W ones:
I think these use what the process code page happens to be, which might not be UTF-8. I don't claim to understand Windows unicode handling, so please tell me I'm wrong, but in our previous Windows environment we used
along with the *W wide variants of Windows API functions (but https://github.com/google/leveldb/issues/755#issuecomment-559721707 is a better way using std::codecvt).
I'm seeing some downstream test failures regarding file names, and I suspect this might be the cause.