Win32: support unicode editing

avih commented 1 year ago

The first commit adds support for building mingw with CONFIG_UNICODE_SUPPORT, but does not enable it yet.

For the most part it only changes some preprocessor conditions which previously considered ENABLE_PLATFORM_MINGW32 exclusively, and now consider also ENABLE_UNICODE_SUPPORT.

There's at least one thing which is not yet implemented - prefix-case-update with tab completion, e.g. with unicode enabled, typing ~/desk and then tab to complete results in ~/desktop/ rather than ~/Desktop/.

It's possible that there are more unnoticed issues, but it does seem to work nicely overall.

The second commit adds a script which can transform mingw64_defconfig into a unicode build config as mingw64u_defconfig which enables the UTF8 manifest, input, and editing.

It does not add the resulting config to the git repo, and I'm leaving this decision and additions for later.

Building with this config and running on Windows 10/11 results in a reasonably useful unicode build, while on Windows 7/8 it should behave as if the unicode support is disabled.

The third commit adds Emoji width and variations support to the internal wcwidth database, and could be considered more appropriate for upstream, but for now it's here.

Let me know if you want to drop the third commit from this PR.

Also, I'm not sure if the new preprocessor nesting/indentations are the preferred style, so let me know if you want it changed (or feel free to update the indentations yourself).

avih commented 1 year ago

Force-pushed an indentation fixup at the 1st commit (deleted one extra tab indent).

rmyorston commented 1 year ago

Looks good in my (minimal) testing.

The only problem I spotted is that the name of the script should probably be mk_mingw64u_defconfig rather than mk_mingw64u_defconig. Having execute permission would be nice too.

avih commented 1 year ago

should probably be mk_mingw64u_defconfig rather than mk_mingw64u_defconig

Oops. Thanks. Force-pushed a fix (also, I pushed from windows before, so the script ended up non-executable, now it's executable).

Looks good in my (minimal) testing.

Right. That's roughly my assessment too, but there gotta be other issues besides the prefix-case on tab completion, because the commit basically undo many of the windows-specific blocks when unicode is enabled.

My approach was to enable unicode and then add missing bits and disable windows-specific code till it compiled (I did read parts of the code, but I didn't study all of it thoroughly). I also figured this will make it obvious that nothing is broken without unicode, because the patch is so minimal and relatively easy to review.

However, it most probably can be improved over time (or maybe the windows-specific parts refactored or some such to also apply when using unicode).

But for now it's both safe and seem to be generally working, so that's not a bad start.

avih commented 1 year ago

My approach was ... However, it most probably can be improved ...

Please wait with the merge.

I want to go over the inverted block conditions again, maybe undo some of them, and for the others explain why they're disabled.

As it stands, of the 7 inverted mingw-blocks conditions when unicode is enabled:

Two disable inc_cursor when the terminal is without VT input - need to recheck why.
One restores slow-terminal optimization (up to 4 \b instead of one ESC"[%uD") - in retrospect the original looks OK?.
Three disable case-insensitive tab completion - IIRC it doesn't compile with unicode enabled otherwise.
One related to which byte values are considered "control" - in retrospect I think the original condition was OK

So please wait with the merge till I re-examine those.

avih commented 1 year ago

I pushed a fixup commit which I intend to squash into the first commit after you review it. It reduces the scope of the original change:

One restores slow-terminal optimization (up to 4 \b instead of one ESC"[%uD") - in retrospect the original looks OK?.

One related to which byte values are considered "control" - in retrospect I think the original condition was OK

These changes were indeed not required, and were reverted.

Two disable inc_cursor when the terminal is without VT input - need to recheck why.

The issue is that inc_cursor is broken with wide glyphs. But I still reverted those changes, in favor of a new runtime condition depending on whether unicode is active (instead of disabling at build time if unicode is supported).

I also raised some questions related to inc_cursor in a new comment.

Three disable case-insensitive tab completion - IIRC it doesn't compile with unicode enabled otherwise.

That's the only proper remaining change. As mentioned, I did notice at least one modified behavior with unicode (~/desk tab-completes to ~/desktop/ instead of ~/Desktop/), but also noticed that some case-sensitive things do work correctly - like completing function names.

However, I don't actually fully understand the impact of disabling these code blocks, and there could be real bugs lurking there, so I'd really appreciate if you could help me ensure that at least there are no nasty bugs from disabling these parts of the code.

rmyorston commented 1 year ago

inc_cursor() is intended to improve the appearance when moving right along the command line. With put_cur_glyph_and_inc_cursor() the console in Windows 8 and earlier takes a while to display the cursor after a movement. In this case displaying the glyph is unnecessary. Not doing so results in the cursor appearing more quickly. In other cases where put_cur_glyph_and_inc_cursor() is used it is necessary to display the glyph.

The console and terminal in Windows 10 and 11 don't suffer from the same problem and using put_cur_glyph_and_inc_cursor() there is fine.

I don't now see any reason for put_cur_glyph_and_inc_cursor() to be used if VT_INPUT is enabled. In my tests removing the VT_INPUT condition doesn't seem to make any difference, so I recommend changing the code to:

static void input_forward(void)
{
    if (cursor < command_len)
#if !ENABLE_PLATFORM_MINGW32
        put_cur_glyph_and_inc_cursor();
#else 
    {
        if (unicode_status == UNICODE_ON)
            put_cur_glyph_and_inc_cursor();
        else
            inc_cursor();
    }
#endif
}

In a non-Unicode build inc_cursor() will be used unconditionally. This will give the intended appearance with all versions of Windows.

In a Unicode build put_cur_glyph_and_inc_cursor() will be used when CP 65001 is being used, as required for wide glyphs. On Windows 8 and below it's unwise to set CP 65001, so inc_cursor() will be used and the cursor will appear promptly. That should keep everyone happy.

avih commented 1 year ago

In this case displaying the glyph is unnecessary. Not doing so results in the cursor appearing more quickly. In other cases where put_cur_glyph_and_inc_cursor() is used it is necessary to display the glyph.

Right. Thanks. I don't think I saw this explanation previously (in a comment or commit message), though IIRC I did see a commit message about a similar issue with the slow-terminal optimization code (the multiple \b).

I don't now see any reason for put_cur_glyph_and_inc_cursor() to be used if VT_INPUT is enabled. In my tests removing the VT_INPUT condition doesn't seem to make any difference

Same. It seems to work (on win10) even with BB_TERMINAL_MODE=0 - where VT_INPUT should be disabled.

I recommend changing the code to...

Yes, I also think that's better.

But I think this fixes an issue with 8ade494aebe60ea14026d48025a462e6d0b58a7f (to ignore VT_INPUT), so I don't think it belongs at the unicode editing PR.

I don't mind if you make it use inc_cursor unconditionally again and then I'll rebase the PR, or you can also fix that after the PR is merged. I don't think there's an urgency here.

Let me know what you prefer.

Other than this, shall I squash the fixup commit into the first commit? (and still keep the 2nd and 3rd commits separate).

rmyorston commented 1 year ago

I'll fix the VT_INPUT thing after this has been merged.

shall I squash the fixup commit into the first commit?

Hold on for now. I haven't looked at the case-insensitive tab completion yet.

avih commented 1 year ago

put_cur_glyph_and_inc_cursor() will be used when CP 65001 is being used, as required for wide glyphs

Actually, with DBCS codepages like 936, wouldn't that also end up with wide glyphs? (no idea whether inc_cursor is similarly incorrect in this case, but I'd guess yes).

That being said, if the cursor forwards badly on win 7/8 without inc_cursor then I think we're stuck with that even if it's not incremented correctly...

Hold on for now. I haven't looked at the case-insensitive tab completion yet

Well, that part is not touched at the fixup commit, but holding on for now anyway.

avih commented 1 year ago

So, for some reason I thought the remaining 3 mingw code-blocks conditions were needed for compilation, but apparently they're not, so the 2nd fixup commit I just pushed reverts them.

After the 2 fixups, no flipped preprocessors conditions remain, and it compiles and seems working.

However, the code which replaces the prefix with a case-correct version (e.g. so that ~/desk tab-complete to ~/Desktop/) only exists at the non-unicode path, and so this still doesn't work.

It's now less likely to have lurking bugs, but it's still possible if some of the relevant mingw code which handles case sensitivity is only compiled without unicode support - just like the case-correct prefix update code (though the case-correct-update is unlikely to have lurking bugs, othre than the obvious that it doesn't work).

But with the 2nd fixup it's even less intrusive than before, so that's better.

Another thing which I noticed with cmd.exe console window and Windows terminal 1.18 (but not WT 1.17) is that typing can be a bit wonkey, especially when typing non-ASCII chars quickly-ish.

It seems like that when typing with one finger (i.e. that each key is depressed before the next one is pressed) works OK, but when typing fast (e.g. F-down O-down F-up O-up, assuming the KB is in non-en mode and F and O produce non-ASCII chars), then it produces some incorrect repeatitions of the keys.

I'm going to assume that this is a ReadConsoleInput_utf8 issue, possibly related to the missing-key-down workaround - which is indeed applied only for codepoints > 127, so I'll look into that too.

It's weird that I never noticed it before though, not sure why...

@rmyorston can you reproduce the quick unicode typing issue in busybox-w32 with unicode?

avih commented 1 year ago

I'm going to assume that this is a ReadConsoleInput_utf8 issue, possibly related to the missing-key-down workaround

It was indeed this.

Fixed by force-pushing a commit "win32: UTF8 input: improve missing-key-down hack" before the others (the other commits are unmodified, and the two fixup commits stil await squashing into "win32: support interactive unicode-aware editing").

avih commented 1 year ago

so I recommend changing the code to:

I updated the comment regarding inc_corsor at the first fixup, and added a commit which changes the condition as suggested.

Since the unicode-editing commit currently (after the fixups) doesn't modify any existing mingw preprocessor conditions, the remaining question is whether there are lurking bugs introduced by enabling CONFIG_UNICODE_SUPPORT which result from some existing mingw code which was only added to to code-paths without unicode support.

We already know about the missing tab-completion-prefix-case-update issue, but there could be others, possibly more severe.

avih commented 1 year ago

Fixed by force-pushing a commit "win32: UTF8 input: improve missing-key-down hack"

And this introduced a regression where it incorrectly adds a false ENTER on launch (the key-up of ENTER after launching busybox doesn't have a matching ENTER-down, so it changes this up to down).

This is caused because previously the hack was applied to codepoints above 127 (excluding ENTER), and now it's for all non-0 codepoints.

Need to think how to handle it nicely.

avih commented 1 year ago

This is caused because previously the hack was applied to codepoints above 127 (excluding ENTER), and now it's for all non-0 codepoints.

Pushed a fixup which changes it back to codepoint > 127. I'll squash it into the improved hack commit after review,

avih commented 1 year ago

For convenience of review, I've pushed the squashed commits to https://github.com/avih/busybox-w32/commits/win32-unicode-edit-combined

The source files are identical to this PR, and the unicode-editing commit message was updated to reflect the squashed fixups.

avih commented 1 year ago

Since the unicode-editing commit currently (after the fixups) doesn't modify any existing mingw preprocessor conditions, the remaining question is whether there are lurking bugs introduced by enabling CONFIG_UNICODE_SUPPORT which result from some existing mingw code which was only added to to code-paths without unicode support.

We already know about the missing tab-completion-prefix-case-update issue, but there could be others, possibly more severe.

So, I went over all the instances of UNICODE_SUPPORT at the source code, and I think there are only two cases of MINGW code which is disabled when CONFIG_UNICODE_SUPPORT is set:

The prefix-case-update on tab completion - which we know about.
Some char > 127 which prints as ? without MINGW and normally with MINGW (which is not a problem, because the unicode path prints all chars equally).

So I think the completion-prefix-case-update might actually be the only issue when building with CONFIG_UNICODE_SUPPORT, which, even if not fixed later, is not critical I think (because the FS is case-insensitive, then ~/desktop/ refers to the correct dir even without the prefix-case-update).

So overall, we should be pretty good I think.

rmyorston commented 1 year ago

The squashed-commit version seems to be in reasonable shape now. A couple of minor things before merging, if you please:

I'd prefer the changes in the 'unicode: identify emoji width and modifiers' commit to be protected by ENABLE_PLATFORM_MINGW32. Building a Linux version of BusyBox from the busybox-w32 source should result in the same binary as using the upstream source.
A few places in the code and commit messages need s/inc_corsor/inc_cursor/g

avih commented 1 year ago

Building a Linux version of BusyBox from the busybox-w32 source should result in the same binary as using the upstream source.

Right, I thought because it's not mingw-specific it should be outside of the guard, but I see your point. done.

s/inc_corsor/inc_cursor/g

... done.

avih commented 1 year ago

Hmm.. interesting, I reset this PR branch to the squashed branch win32-unicode-edit-combined, and it spread the commits between the messages here according to their dates... (I thought they would be bunched at the bottom...)

avih commented 1 year ago

Building a Linux version of BusyBox from the busybox-w32 source should result in the same binary as using the upstream source.

Right, I thought because it's not mingw-specific it should be outside of the guard, but I see your point. done.

Do you prefer to just drop this commit? I wouldn't mind.

rmyorston commented 1 year ago

Do you prefer to just drop this commit?

Since we have it we might as well include it. Wouldn't want people to feel they were missing out on emojis.

OK, I'm about to press the big green button...

avih commented 1 year ago

Wouldn't want people to feel they were missing out on emojis.

That was my motivation as well... :)

But then I realized I don't actually know where it manifests (well, except in fold and some such), so not sure how much it adds in practice.

But it doesn't hurt either.

Thanks.

avih commented 1 year ago

Do we want to add some unicode blurb at the readme and/or provide such builds?

Also, I wanted to see what it would take to have a more up to date wcwidth, and looks like this termux implementation is up to date with latest Unicode (15) https://github.com/termux/wcwidth/blob/master/wcwidth.c .

This includes the emoticons block which I added in my commit (U+1F600 - U+1F64F), and also some more blocks which are still missing at the busybox wcwidth.

There are other implementations too, most commonly referred to is https://www.cl.cam.ac.uk/~mgk25/ucs/wcwidth.c from 2007 with Unicode 5 data, which doesn't include emojis (and seems similar to the busybox implementation scope).

EDIT - it's similar in scope because it seems to be based on this exact version from 2007 - https://github.com/rmyorston/busybox-w32/blob/master/libbb/unicode.c#L390-L396

Of course, this needs to be in sync with the terminal which renders those, but there is no system windows wcwidth which the windows terminal uses and which utils (like busybox-w32) can also use. It was asked at the windows terminal repo, and there's no good reply that I can tell (other than using a popular unicode library) - https://github.com/microsoft/terminal/issues/218

So I guess sticking to data which is derived directly from the latest Unicode tables is the second best... and in general more likely to match the Windows terminal database than other methods IMO.

rmyorston commented 1 year ago

Do we want to add some unicode blurb at the readme and/or provide such builds?

That would be the eventual aim. In the short term I'll update my prerelease build/deploy scripts to include a Unicode binary. That can then be mentioned in issue 17 to widen the scope of testing.

Beyond that, is this work now in a suitable state for inclusion in a full release?

There isn't any fixed schedule for releases. It's nearly two months since the last one so it's about time to start considering a new one. But if a delay of a month would result in a material improvement that would be fine too.

As for wcwidth, I defer to your preference. If an update is worthwhile it would probably be best to submit it upstream.

avih commented 1 year ago

Beyond that, is this work now in a suitable state for inclusion in a full release?

I don't know what the criteria is, and hopefully it is suitable, but I haven't dog-fooded the unicode version which is in master (I have been using the UTF8-manifest version for more than a year, but not the recent utf8 input and UNICODE_SUPPORT build opts).

I will start using the unicode build now though (with the config created by mk_mingw64u... script), as I was waiting for it to get merged first.

I do think it would be sutable for inclusion if it's described as experimental, but I wouldn't know where to put that note at (I don't think too many users read the release notes TBH... I know I typically don't, eventhough I do find it interesting when I sometimes do read it).

But if a delay of a month would result in a material improvement that would be fine too.

Well, if you and I will use it regularly from now on (which I intend to do), then hopefully any major existing issues would pop up, and if we don't fix/file bugs/PRs then hopefully it's in decent state.

FWIW, there's not a lot of UNICODE_SUPPORT stuff at the busybox utils other than line edit, and I think the most visible part would be the UTF8 input (like the incorrect repeated keys thing which I fixed), and any input issues should raise their head fairly quickly I think...

As for wcwidth, I defer to your preference. If an update is worthwhile it would probably be best to submit it upstream.

Upstream would definitely be more suitable, but it uses hand "optimized" code to make it smaller, which is a PITA to update to a newer version, so I'm not going to send a patch (I might report that it's out of date though).

How worthwhile it is depends on your point of view. It's a cross between 15 years of advances in Unicode (unicode 5 to 15) and how much wcwidth is used in busybox.

So far the only place I know for sure it's used is line-wrap during line-edit (e.g. shell prompt). As far as I can tell it's not used in fold.

Test case for the shell prompt:

Paste 😀 (U+1F600) repeatedly until the line wraps, then move backwards with the left arrow, and confirm the cursor goes back to the end of the 1st line (and then can continue backward to the begining of the line).
Try the same with 🚀 (U+1F680).

The result of 2 is that you can't go back to the 1st line, because U+1F680 is not at the unicode 5 double-width table and so it doesn't know that the line was wrapped, and doesn't move the cursor to prev line (U+1F600 is also not in unicode 5, but it was added to the database in our emoji commit).

For now, I added commit "unicode: use newer wcwidth with BB_NEW_WCWIDTH=1" here https://github.com/avih/busybox-w32/commits/avih so that I can experiment with it.

ale5000-git commented 1 year ago

Is there any 32-bit build with all UTF8 settings enabled except FEATURE_UTF8_MANIFEST? In this way it would support UTF8 on new versions of Windows (with chcp) but still works perfectly on old versions.

avih commented 1 year ago

Is there any 32-bit build with all UTF8 settings enabled except FEATURE_UTF8_MANIFEST? In this way it would support UTF8 on new versions of Windows (with chcp) but still works perfectly on old versions.

It should run perfectly on win7/8 with the UTF8 manifest, just like the normal builds - without unicode support.

It can't currently have unicode features on win 7/8 regardless of the UTF8 manifest.

It can't run at all on Windows XP with the UTF8 manifest.

All the busybox unicode build options only work on UTF-8 strings, and currently the strings are UTF8 only on windows 10+ with the UTF8 manifest, which is why these unicode features are disabled when the strings are not UTF8 (win 7/8).

If it's built without the UTF8 manifest then it won't have UTF-8 on windows 10+ and all the busybox unicode options don't do anything good, so no, you can't build without the UTF8 manifest and still have unicode support.

What is the goal here? to have a single binary which will have unicode on windows 10 but which will still run on Windows XP? That's not possible currently, because the very thing which make it unicode on windows 10 (the UTF8 manifest) is the same thing which makes it not run on XP.

If not, why else would you want this combination of features compared to a separate builds?

ale5000-git commented 1 year ago

Yes, I would like to have, if possible, the same portable executable that can run as UTF-8 on Windows 10 and as NOT UTF-8 on Windows XP.

How I already said in the other post currently I don't have time, but the cause should be investigated. Or do you know a site that explain the exact cause? (I mean something like: parsing errors or values explicitly disallowed)

avih commented 1 year ago

Or do you know a site that explain the exact cause?

I don't. I only know that if the UTF8 manifest is attached, then win7/8 ignore it (because they don't support UTF8 ACP), and Windows XP refuses to run the binary, I'm guessing maybe because XP doesn't like the XML schema at the manifest.

Feel free to research as much as you like. Once you have conclusions, please do share them.

avih commented 1 year ago

Of course, this needs to be in sync with the terminal which renders those, but there is no system windows wcwidth which the windows terminal uses and which utils (like busybox-w32) can also use. It was asked at the windows terminal repo, and there's no good reply that I can tell (other than using a popular unicode library) - microsoft/terminal#218

For now, I added commit "unicode: use newer wcwidth with BB_NEW_WCWIDTH=1" here https://github.com/avih/busybox-w32/commits/avih so that I can experiment with it.

For what it's worth, it seems that the windows terminal embeds the unicode 15 data , similar to what busybox does (unicode 5 data), and it should be the same data as in my BB_NEW_WCWIDTH commit (unicode 15 data).

It doesn't look like it has combining chars table though, which is in line with my observations that the windows terminal doesn't currently render combining chars correctly.

rmyorston / busybox-w32

Win32: support unicode editing #340