rmyorston / busybox-w32

WIN32 native port of BusyBox.
https://frippery.org/busybox
Other
677 stars 124 forks source link

Unable to input CJK for newer versions. #335

Open infoagee opened 1 year ago

infoagee commented 1 year ago

Hi, FRP-1035(1.27) is able to input CJK in cmd.exe of win7, but FRP-3128(1.31) and newer versions not. so, what changed?

rmyorston commented 1 year ago

How can I test this?

My first thought is that it might be related to support for the euro symbol which was introduced in the FRP-2962 release. That changed how input is handled.

Could you try this FRP-2910 binary which precedes that change?

infoagee commented 1 year ago

FRP-2910 is ok to input CJK. Is there FRP-2962 binary or source code? Maybe it needs a input method that can input non-ascii characters to do a test.

avih commented 1 year ago

FRP-1035(1.27) is able to input CJK in cmd.exe of win7

Just to be sure, do you mean by typing it at the cmd.exe prompt? e.g.

c:\users\...> busybox.exe ls -l <CJK-file-name>

Or at the busybox shell prompt (which is indeed affected by the euro symbol support)?

Could you please describe how well it works? for instance does typing or pasting this at the busybox shell prompt (replace the <...> with an actual file name with CJK):

c:/users/... $ ls -l "<some unicode/CJK file name>"

For what it's worth, I'm working on improvements to allow arbitrary unicode input with busybox-w32. It's not quite ready yet but it does seem promising. However, it does require Windows 10.

infoagee commented 1 year ago

My first thought is that it might be related to support for the euro symbol which was introduced in the FRP-2962 release. That changed how input is handled.

Seems true, FRP-3128 is ok now when compiled with "CONFIG_FEATURE_EURO=n". Thanks a lot.

infoagee commented 1 year ago

Just to be sure, do you mean by typing it at the cmd.exe prompt? e.g.

The procedure is simple, the output is something like the following:

C:\Users\xxx> C:\Users\xxx>busybox.exe sh -l ~ $ ~ $ echo 的 # here input a CJK char, some busybox versions work, some versions not.

rmyorston commented 1 year ago

Seems true, FRP-3128 is ok now when compiled with "CONFIG_FEATURE_EURO=n".

I'll have another look at euro support.

If only Microsoft hadn't so thoroughly botched it.

avih commented 1 year ago

If only Microsoft hadn't so thoroughly botched it.

I think I'm observing similar issues and inconsistencies in console input while working on generic unicode input support - it largely does similar kinds of things as the euro feature, but with different details and in a bit more generic way.

ale5000-git commented 1 year ago

@rmyorston The character è that usually can be typed/displayed without UTF-8 cannot be typed when chcp is 65001. I'm not sure if it is related but maybe there is a code not working properly in some cases.

rmyorston commented 1 year ago

I'll have another look at euro support.

One thing I noticed is that the changes in windows_read_key() don't seem particularly useful. They allow the euro character to be entered even if the code page doesn't support it. Someone who needs euro support would probably prefer to have the symbol displayed properly too.

I've made this a separate configuration option, disabled in the default build.

There's a PRE-5067 prerelease with this change.

infoagee commented 1 year ago

Could you please describe how well it works? for instance does typing or pasting this at the busybox shell prompt (replace the <...> with an actual file name with CJK):

c:/users/... $ ls -l "<some unicode/CJK file name>"
  • Does it show correctly at the shell prompt before pressing enter?
  • Can it be edited normally at the shell prompt before pressing enter? (left/right arrow keys, delete/backspace, word jump, etc)
  • Does it list the correct file (assuming it exists) after pressing enter?

With "CONFIG_FEATURE_EURO=y", nothing showed when input CJK chars, or paste CJK chars.

so, I am sorry I am unable to do these tests.

avih commented 1 year ago

With "CONFIG_FEATURE_EURO=y", nothing showed when input CJK chars

I was asking specifically about the version where you said it did work... this:

FRP-1035(1.27) is able to input CJK in cmd.exe of win7

Or any newer version which you compiled without FEATURE_EURO and also said it works, like this:

Seems true, FRP-3128 is ok now when compiled with "CONFIG_FEATURE_EURO=n".

And it might also work in the the newst prerelease from yesterday where the Euro feature was split into two different things, and one of them is now disabled by default, here https://frippery.org/files/busybox/prerelease/?C=M;O=D

infoagee commented 1 year ago

I was asking specifically about the version where you said it did work... this:

oh, FRP-1035 has been working very well for all your 3 questions all the time. Only one thing worth mentioning: one need type arrow twice to move cross one CJK char. I think this is ok, and I don't mind this.

avih commented 1 year ago

Thanks.

Did you also try the last prerelease from yesterday? does it work there too?

infoagee commented 1 year ago

PRE-5067 does not work properly for me. My cmd.exe default code page is cp936, i.e., Chinese GBK. PRE-5067 seems changing ONE CJK char to TWO ? (question mark):

C:\Users\xxx>e:\Download\busybox-w32-PRE-5067-g597d31eef.exe sh -l ~ $ a='??' # here what I typed is a='的', but shown as ?? ~ $ echo -n "$a" | od -tx1 -An # this shows they are question marks. 3f 3f ~ $

rmyorston commented 1 year ago

The workaround to support the euro is only intended to work in CP 858. I've skipped the workaround in all other cases.

Please try the PRE-5068 prerelease.

infoagee commented 1 year ago

oh, 5068 got stuck after I typed a CJK char:

~ $ a=' #stuck here, after I typed or pasted a CJK char

no response to "exit" / "Ctrl-C" / "Ctrl-D" / "Ctrl-Z". CPU usage bumped to ~30% for one hyper-thread, ie, ~7% in total for my 2 HyperThreads/Core * 2CpuCore computer. Temperature rose quickly. I had to close the cmd window, and it closed.

rmyorston commented 1 year ago

5068 got stuck

OK, I see the problem. Hold on...

rmyorston commented 1 year ago

Sorry about that. There's now a PRE-5069 which shouldn't get stuck!

infoagee commented 1 year ago

Never mind, I am not in a hurry and I can use other versions to work around. 5069 seems working well for my serveral simple tests.

avih commented 1 year ago

@infoagee

Would you mind compiling this program, run it, and report what it prints?

#include <stdio.h>
#include <windows.h>

int main() {
    printf("ACP:%u  OEMCP:%u  Con(In)CP:%u  ConOutCP:%u\n",
           GetACP(), GetOEMCP(), GetConsoleCP(), GetConsoleOutputCP());
    return 0;
}

For convenience, I'm attaching a zip with the source code and pre-compiled Windows executable if you prefer (32 bit, compiled with TCC).

codepages.zip

For what it's worth, on my system the output by default is:

ACP:1252  OEMCP:437  Con(In)CP:437  ConOutCP:437

And If attaching a UTF-8 manifest, the output becomes:

ACP:65001  OEMCP:65001  Con(In)CP:437  ConOutCP:437

@rmyorston I'm suspecting that OemToChar and CharToOem (and the buff variants) should really have been "console input CP to ansi" and "ansi to console output CP" at the busybox-w32 source code, respectively, but they happen to usually work because typically the console in/out CP are the same as the OEM CP?

Except when the user changes it with chcp...

So, at the source code, if one wanted to avoid any conversion altogether and the pitfalls that come with it, one could simply set the console in/out cp to ACP... and bingo...

That's pretty much what my generic unicode input patch does - set con in/out CP to ACP if ACP is UTF8, and then avoid the oem/char conversion if both CP are the same (because apparently trying to convert 1 byte of more-than-one UTF8 sequence fails with OemToChar/CharToOem even if both ANSI and OEM are UTF8).

Except for some unrelated windows issues which can be handled too, that's pretty much enough to get 100% working unicode input and output.

rmyorston commented 1 year ago

I'm suspecting that OemToChar and CharToOem (and the buff variants) should really have been "console input CP to ansi" and "ansi to console output CP", respectively, but they happen to usually work because typically the console in/out CP are the same as the OEM CP?

Yes, that had occurred to me too.

one could simply set the console in/out cp to ACP... and bingo...

That requires some thought...

avih commented 1 year ago

That requires some thought...

The only downside I can think of is that external ANSI programs might behave unexpectedly when invoked by busybox-w32, so if that turns out to be a real issue, the original values (which were recorded on entry) can be restored when invoking an external program.

rmyorston commented 1 year ago

A busybox-w32 program doesn't necessarily own the console it's running in. It may have been started from cmd.exe or PowerShell or something else.

It needs to tread very carefully.

avih commented 1 year ago

It needs to tread very carefully.

Yeah. It does need care.

Ideally, the console input CP should only be set when busybox-w32 is taking interactive console input (shell prompt, vi, etc), and similarly, console output CP should only be set when busybox-w32 writes to the console (also non-interactively), and then both conversions are redundant.

Also, such thing should make the euro feature completely redundant (assuming ACP does have the euro symbol, which I imagine it would? I can't test it myself...).

infoagee commented 1 year ago

Would you mind compiling this program, run it, and report what it prints?

ACP:936 OEMCP:936 Con(In)CP:936 ConOutCP:936 This output is for default codepage. Need a result for "attaching a UTF-8 manifest"? then how to "attach"?

avih commented 1 year ago

ACP:936 OEMCP:936 Con(In)CP:936 ConOutCP:936 This output is for default codepage.

Thanks. So indeed OEMCP is the same as console in/out CP. That's good news. Also it confirms that in this case the oem/char conversion (or console/ansi) is indeed redundant.

Need a result for "attaching a UTF-8 manifest"? then how to "attach"?

No, thanks.

But if you want to try it out, attaching a UTF8 manifest is described here (and if you don't have mt.exe, then you can use perc instead. /shameless plug). Of course, the manifest can also be attached at build time, but it can also be attached to some old program without rebuilding it.

It basically makes an "old style" (ANSI) program on Windows 10/11 behave as if the Windows API is UTF8 instead of whatever ACP the system has, effectively making it able to handle any arbitrary Unicode file names which are given as arguments, etc, without changing the program to use the W windows APIs.

So at the output I posted you can see that the program now sees ACP and OEM CP as 65001 (which is the code for UTF8).

I imagine that for you it will do the same - change ACP and OEMCP to 65001, and keep the console CP unmodified. Do keep in mind that it needs Windows 10 or later to work.

avih commented 1 year ago

Ideally, the console input CP should only be set...

Alternatively, which is not unimaginable I think, is to always use ReadConsoleInputW and WriteConsoleW when the console is involved, and convert it to/from ACP near the very edge (right after read, right before write), and then the console CPs don't have any impact anymore.

As a bonus, I think this will make "UTF8 manifest mode" work out of the box too for input/output (though editing still needs to be taught about screen lengths etc, maybe by flipping on some of the existing unicode options of busytbox).

This will likely need some sort of input/output buffering, as it's no longer necessarily 1:1 lengths (between busybox/ansi bytes and wchar_t), but it should still be able to handle correctly all the cases which work today.

infoagee commented 1 year ago

it needs Windows 10 or later to work.

I only have win 7, so I'll stop here.

avih commented 1 year ago

one could simply set the console in/out cp to ACP... and bingo...

It needs to tread very carefully.

Also, keep in mind that busybox-w32 already does a potentially similarly harmful thing by default (euro feature) for anyone with console CP of 850 - where it changes the cnsole CP to something which theoretically could break some ANSI programs, eventhough busybox doesn't always own the console etc etc etc.