win32: unicode: use newer wcwidth by default

avih commented 5 months ago

This commit adds a new wcwidth implementation at libbb/wcwidth_alt.c, and uses it instead of the existing implementation when compiling for windows and CONFIG_LAST_SUPPORTED_WCHAR >= 0x30000.

The windows-target condition keeps non-windows build unmodified, and the last supported wchar threshold is a semi-hack to allow switching between implementations without adding a new config option (the old code supports codepoints up to 0x2ffff).

The new file wcwidth_alt.c was generated by a new scripts/mkwcwidth.sh which prints a wcwidth implementation using latest unicode data from a local clone of https://github.com/jquast/wcwidth . This repo is the main python wcwidth implementation, and is maintained and up to date.

Functional differences from the existing implementation:

Unicode 15.1.0 (latest) with the new version (about 450 ranges of wide and zero-width codepoints), compared to roughly Unicode 5.0 of the existing code (nearly 20 years old spec, about 150 ranges). The new spec includes, among others, various wide icons and emojis, which can now be edited correctly at the shell prompt, have correct width in 'ls', ets.
The old implementation returns -1 (non-printable) for surrogates, while the new code returns 1, though this is inconsequential, and POSIX doesn't care. Also libc implementations vary in this regard.

Technical differences:

The new code is very small and straight forward, thus allowing easy (tables) update to newer spec as needed. The old version mixes code, data, and preprocessor checks, and is hard to automate updates.
The old version compiles less code/data when the last supported wchar is smaller, while the new version doesn't. This doesn't matter because the new version is enabled only for the full range.
The old version compresses the data by using 16-bit ranges (and more code), while the new version uses 32-bit ranges (Unicode is 21 bit). This is one of reasons the new version is bigger (about 3.5K for the new data). The other reason is the the new version has about 3x data ranges compared to the old version.

Overall, this adds about 2.5K to the binary when enabled, with the new data adding about 3K, and the new code saving about 0.5K, and in the context of Windows Unicode binary, likely matters little.

avih commented 5 months ago

I wasn't sure if the new implementation should be inline like the existing implementation. I thought a new file is cleaner.

It's also possible to make the script only generate the data tables, while keeping the new code (wcwidth and intable) as part of the existing code.

But because the code is tied to the data (tables) format, I thought though it's better that the script generates both the data and the code which uses it.

Let me know if you have different preferences. I don't mind too much.

avih commented 5 months ago

I have an approach to automate compressed (16 bit) ranges, which roughly halves the data size, and is actually also faster.

I'll post it a bit later.

avih commented 5 months ago

Just added a commit which halves the data size, and is also considerably faster, but still reasonably simple.

See the commit message for details.

The two commits should eventually be squashed, but for now I kept them separate to make it easier to observe the changes.

I think it's useful to have a mroe modern wcwidth implementation, and I think this one is cute and tight.

Thoughts?

rmyorston commented 5 months ago

Look OK to me (though I'm no Unicode expert).

I'm happy with the separate source file and generating both code and data.
There should probably be copyright/licence information in the source file as well as the script.
The function definition in libbb/wcwidth_alt.c should be int FAST_FUNC wcwidth(uint32_t ucs), or the 32-bit build fails. (FAST_FUNC is supposed to optimise function calls. It's only used in 32-bit builds.)

avih commented 5 months ago

Thanks.

There should probably be copyright/licence information in the source file as well as the script.

The 2nd commit did add copyright to the script, but I'll change it to the form found e.g. in mkconfigs (though many scripts don't have copyright). I'll keep the license MIT - same as at https://github.com/jquast/wcwidth/blob/master/LICENSE .

What kind of copyright should the generated file have? same as the script?

The function definition in libbb/wcwidth_alt.c should be int FAST_FUNC wcwidth(uint32_t ucs), or the 32-bit build fails. (FAST_FUNC is supposed to optimise function calls. It's only used in 32-bit builds.)

Right. I'll do that.

rmyorston commented 5 months ago

What kind of copyright should the generated file have? same as the script?

That's up to you. It has to be compatible with GPLv2 but other than that I don't mind.

avih commented 5 months ago

Force-pushed:

Rebased to latest master.
Changed scripts/mkwcwidth.sh to scripts/mkwcwidth - similar to other scripts.
Squashed the two commits.
Added full copyright block to the script, short copyright block to the C file, both MIT.
Added FAST_FUNC.
Refined the commit message.
Add a commit to revert commit 878b3cd2 since it's covered by the new wcwidth, and would simplify merge with future upstream wcwidth changes.

Please do go over the changes to ensure I didn't botch something.

rmyorston commented 4 months ago

Merged, thanks.

There are new prereleases, but only the Unicode binary will be affected.

avih commented 4 months ago

Thanks.

rmyorston / busybox-w32

win32: unicode: use newer wcwidth by default #390