thpatch / win32_utf8

Transparent UTF-8 support for native Win32 ANSI applications
The Unlicense
99 stars 7 forks source link

I found a UTF-8 / WCHAR converter for win32 #5

Closed Amaroq-Clearwater closed 4 years ago

Amaroq-Clearwater commented 4 years ago

Here's the link: https://gist.github.com/xebecnan/6d070c93fb69f40c3673

I figure that you could give this a look so that it's also possible to read and convert between UTF-8 and WCHAR strings, and thus add WCHAR support to ANSI programs using this wrapper as well.

Thoughts?

brliron commented 4 years ago

I'm sorry, but I don't really see the point.

First, we already know that converting between UTF-8 and WCHAR is possible. In fact, we do it all the time. Almost every function we overwrite begins with WCHAR_T_CONV (and a matching WCHAR_T_DEC). That macro tries to convert from UTF-8 to UTF-16, and if it doesn't work, from "ANSI" (as, the locale chosen for win32_utf-8) to UTF-16. And when a Windows function returns a string, we convert it from UTF-16 to UTF-8.

Then, in the most common use case, we have a program assuming ANSI strings, and the end goal is to make that program use unicode strings without changing it. The program is built with ANSI in mind, it doesn't care what kind of unicode it gets as long as it looks like ANSI.

And the biggest problem: UTF-16 doesn't look like ANSI at all. UTF-8 strings have the nice property of being NUL-terminated, just like regular C strings. A side effect is that most functions working on ANSI strings will work on UTF-8 strings - for example, the strcpy implementation is exactly the same for ANSI strings and UTF-8 strings - you just copy everything up to the NUL byte from the source to the destination. It works with both ANSI and UTF-8 (btw, UTF-8 was actually designed with this in mind) But UTF-16 wasn't designed for that. Every character is at least 2 bytes, and for every ASCII character, one of those bytes is NUL. An ANSI or UTF-8 implementation of strcpy, when given an UTF-16 string, will copy the string up to the 1st NUL byte. If you try to copy an UTF-16 string with only ASCII characters... well, you will only copy the 1st character, because the ANSI implementation will see a byte, a NUL byte, and will stop there. It won't even add a NUL character at the end. And in my experience, compiled code have a lot of inlined strlen and strcpy calls. Because they only care about a byte being NUL or not NUL, the ASCII implementations work well with UTF-8 strings, but they would crash on UTF-16 strings.

Amaroq-Clearwater commented 4 years ago

Ah, I understand. My apologies.