thpatch / win32_utf8

Transparent UTF-8 support for native Win32 ANSI applications
The Unlicense
99 stars 7 forks source link

Properly supports strings with mixed encoding #11

Closed brliron closed 3 months ago

brliron commented 9 months ago

@zero318 This code tends to be called quite a lot - which is a good thing because it fixes this bug everywhere. But that also means that performance matters. You're good at optimizing code, so what do you thing about that implementation, does it need to be improved?

Amaroq-Clearwater commented 3 months ago

Oh, this is useful. Developers targeting multiple regions will love this.

brliron commented 3 months ago

Unfortunately, we discussed this on Discord (https://discord.com/channels/213769640852193282/253898950099075085/1185208172684779530), and this is too unreliable to be used in production. For example, "テア" in SHIFT-JIS is is 0xC3 0xB1 as hex and 11000011 10110001 in binary, which matches the requirements for a 2 byte UTF-8 codepoint (and is mapped to U+00F1, "ñ"). Because of that, this feature would replace any instance of "テア" in a SHIFT-JIS string with "ñ", mistakenly thinking that this small SHIFT-JIS sequence was in UTF-8. I forgot to close the PR after this discussion.

Amaroq-Clearwater commented 3 months ago

Could make it a separate build option to partly resolve that, and/or perhaps add some sort of heuristic to make it just a tad more reliable. Just a thought.

Tragic that this won’t be implemented…

Amaroq-Clearwater commented 3 months ago

@brliron Just spitballing here… For one possible heuristic solution, you could probably train a miniaturized language model to identify the locale of problematic codepoints based on that of surrounding characters, and then try to optimize the resulting map to create a faster heuristic (though it might be resource intensive). And by using blue noise (look it up) instead of white noise, you could also probably increase consistency somewhat.

However, I’m not an expert on neural networks… so I have no clue if that would actually work.

I really hope that this helps even a little…