moonshadow565 / ritobin

MIT License
26 stars 4 forks source link

Ensure paths are always converting utf8 <-> utf16 correctly #4

Closed Morilli closed 1 year ago

Morilli commented 1 year ago

This was previously problematic on windows, where native paths are using utf-16 encoding, but file paths may be provided in a different encoding, like utf-8.

moonshadow565 commented 1 year ago

This won't fix the issue as source of path is char const* which on windows by default is local code page. Correct way to fix this would be to either:

  1. use wmain and wchar_t which fs::path has overload to construct from. This approach won't work because argument parsing library (as everything else in normal C/C++ ecosystem) operaters on narrow char.
  2. set page encoding to utf-8 This approach will only work on windows 10+ but that should be fine, anyone on anything less than windows 10 can go and fix it themselves if they need it.

To set correct page encoding a manifest file can be provided to base library here: https://github.com/moonshadow565/ritobin/blob/master/ritobin_lib/CMakeLists.txt#L8

You can see example of how this is done in: https://github.com/LeagueToolkit/cslol-manager/blob/master/cslol-tools/CMakeLists.txt#L65

Morilli commented 1 year ago

Okay I have no idea why but somehow just setting the codepage to utf8 in the manifest works...

moonshadow565 commented 1 year ago

Might as well pull in long path manifest?

moonshadow565 commented 1 year ago

The reason it works is because std::filesystem::path on windows msvc stores paths internally as wchar_t (aka utf16). It uses win32 api to convert from char to wchar_t which when set to utf8 codepage will naturally be encoded in utf8.

Morilli commented 1 year ago

Might as well pull in long path manifest?

Can do that I guess.

It uses win32 api to convert from char to wchar_t which when set to utf8 codepage will naturally be encoded in utf8.

But that doesn't explain why that same mechanism produces garbage when the default codepage is not utf8 (shouldn't the commandline arg use the same codepage as the conversion function?).

moonshadow565 commented 1 year ago

But that doesn't explain why that same mechanism produces garbage when the default codepage is not utf8 (shouldn't the commandline arg use the same codepage as the conversion function?).

Can be for multiple reasons. If code page is mismatched between programs calling other programs it will lead into non-ascii characters converting to wrong characters when roundtriping thru wchar_t. If you are using msys built shell it could be likely that is trying to pass arguments as utf-8.