rmyorston / busybox-w32

WIN32 native port of BusyBox.
https://frippery.org/busybox
Other
677 stars 124 forks source link

Win32: make unicode print correctly regardless of console CP #349

Closed avih closed 1 year ago

avih commented 1 year ago

The first commit is a minor refactor, to make it simpler for the later commits.

The second commit makes the unicode build output unicode correctly by default regardless of the console CP, e.g. at the shell prompt or the output of ls.

The third commit allows disabling this output conversion with reasonably standard means, e.g. using LC_ALL=C for both the non-unicode and the unicode build (the unicode build only had this issue since the second commit above, because till now it never converted and instead hoped the console CP is UTF8).

There's still an unrelated unicode issue which I need to explore, where typing unicode input while caps-lock is on seem to generate both the caps ASCII letters, and the unicode input (tested only with windows terminal 1.18 so far).

So I might add a fix to this PR later, if it's small enough, once I fix it.

avih commented 1 year ago

Updated the 2nd and 3rd commits - with FEATURE_UTF8OUTPUT the console CP test now happens at `conv[f]writeConrather than atwriteCon_utf8, which allows re-using the existig code path to print without conversion, instead of bypassing the conversion insidewriteCon_utf8`.

Also, removed a compiler prototype warning by using void argument at static int conout_conv_enabled(void) { ... }.

avih commented 1 year ago

There's still an unrelated unicode issue which I need to explore, where typing unicode input while caps-lock is on seem to generate both the caps ASCII letters, and the unicode input (tested only with windows terminal 1.18 so far).

So, as far as I can tell, this is a windows terminal 1.18 issue. The windows console and windows terminal 1.17 don't have this issue.

In terminal 1.18, when in English input mode and caps-lock is on, and pressing e.g. a, we get a key-down event of A (as expected) and key-up event of a (bug). This works correctly despite the issue because busybox only looks at key-down.

When in non-English input mode and caps-lock is on, and pressing a, we get a key-down event of A (seems correct), and key-up event of the non-english codepoint (when using ReadConsoleInputW) - that's a bug as far as I can tell.

In this case, the workaround of "missing key-down event" kicks in for the unicode codepoint up event, and busybox-w32 interprets both the A keydown, and also readConsoleInput_utf8 generates a faux down event of the non-English codepoint.

The workaround behaves as expected, as it's not designed to deal with A down and some unicode codepoint up. I don't think we should enhance it to work around this issue as well.

The workaround is not applied in English input mode (A down, a up) because it intentionally doesn't mess with ASCII7 values, and this is another confirmation that it was a correct decision.

If using ReadConsoleInputA then we get down event of Aand few up events of the UTF8 values of the non-English cdepoint.

With the windows console or windows terminal 1.17 and caps-lock is on we always get the caps-letter A down and up events regardless if the input mode is English or non-English, which seems to be the correct behavior.

So for now I consider this a windows terminal 1.18 (development build) issue.

As such, with this PR, I think the unicode build is ready for public consumption.

avih commented 1 year ago

Also, note that with this PR, charToConA is no longer used, because the unified functions use only charToConBuffA.

charToConBuffA is now used only in one place outside the unified functions - when setting the console title (I didn't try it, but I presume it's indeed required).

It's probably OK to keep charToConA around even unused, for completeness, and the linker should ignore it anyway, so it shouldn't have an impact on the binary size, presumably.

But if you want it removed, let me know.

avih commented 1 year ago

As such, with this PR, I think the unicode build is ready for public consumption.

I still think you should consider adding something like win10+ to the binary name, because otherwise, once (if) we have native unicode, and it has the same name like before, people would not realize it now works also on earlier versions of windows - because it has the same name and they already know the u version doesn't work on win7...

But if a win10+u name becomes u then it's easier to tell the dependency on win10 has changed.

rmyorston commented 1 year ago

Look good to me. Thanks.

avih commented 1 year ago

So, while writeCon_utf8 is not too slow, it's still slower compared to the direct fwrite case (e.g. when the console codepage is UTF8).

With the windows console, it's about 1.4x slower.

Tested with this file duplicated to about 5M, and then cat this file is about 3.4s with console codepage 437 (where witeCon_utf8 is used), compared to 2.4s with cp UTF8 (uses fwrite).

However, that's masked by the speed (slowness) at which the conhost processes the input.

Apparently, with newer conhost/conpty (e.g. when using a recent windows terminal), it's able to process the input a lot quicker - about 4x quicker on my system compared to the builtin windows console (win10).

So the results with the same test as above now become about 2.5x slower (1.6s vs 0.6s). Still not terrible, but also not fun.

Luckily, it turns out that the main limit is the 256 (wide) chars buffer which writeCon_utf8 uses.

Increasing it to 4096 brings writeCon_utf8 case down to ~ 0.65s, which is within ~ 10% of the native fprintf with new conhost, and with negligible difference from fprintf with the builtin console. Making it 8192 still improves slightly, but these are diminishing returns, so 4096 is probably fine.

So I'd like to increase it to 4096, but currently it's on stack, and 4096 wide chars is 8K, which is not so fun on the stack, even if unlikely to become an issue.

So I'll probably make it static, because it's single threaded anyway.

Are you OK with making the buffer static with 4096 or 8102 wide chars? Any preferences or other comments?

rmyorston commented 1 year ago

I'd be inclined to make wbuf a static pointer and allocate the buffer on first use. That way it's only necessary to pay the cost of the buffer if it's actually used.

avih commented 1 year ago

Thanks, I'll do that.

On a related note, it turns out it's possible to use a new conhost without the windows terminal.

Apparently OpenConsole.exe which ships with the windows terminal is conhost, and it's stand alone. It's also available at the CI artifacts of every windows termimal commit (as it's developed at the same repo), so one could grab the very latest at any point in time.

One could pluck it out of the windows terminal dir and place it anywhere, then double clicking it opens a (new) windows console with cmd.exe. It also supports arguments to execute instead of cmd.exe, e.g. OpenConsole c:\path\to\busybox.exe sh -l (the same is also possible with the builtin conhost.exe).

It has benefits over the native conhost, such as much better rendering and fonts fallback support (though it seems to render emojis always in black and while, to whom that matters), greater speed, and I'm guessing general improvements of the VT behavior.

It's also "purer" than the windows terminal, way tinier, and might not have bugs which the windows terminal has, such as the bad keyup event with caps lock which I mentioned above - OpenConsole.exe does not have this issue, while the windows terminal which ships this openconsole does have the issue.

So just an FYI.

avih commented 1 year ago

OpenConsole.exe ... also supports arguments to execute instead of cmd.exe, e.g. OpenConsole c:\path\to\busybox.exe sh -l

Unofficially, but still...