rmyorston / busybox-w32

WIN32 native port of BusyBox.
https://frippery.org/busybox
Other
670 stars 124 forks source link

win32: unicode: use newer wcwidth by default #390

Closed avih closed 4 months ago

avih commented 5 months ago

This commit adds a new wcwidth implementation at libbb/wcwidth_alt.c, and uses it instead of the existing implementation when compiling for windows and CONFIG_LAST_SUPPORTED_WCHAR >= 0x30000.

The windows-target condition keeps non-windows build unmodified, and the last supported wchar threshold is a semi-hack to allow switching between implementations without adding a new config option (the old code supports codepoints up to 0x2ffff).

The new file wcwidth_alt.c was generated by a new scripts/mkwcwidth.sh which prints a wcwidth implementation using latest unicode data from a local clone of https://github.com/jquast/wcwidth . This repo is the main python wcwidth implementation, and is maintained and up to date.

Functional differences from the existing implementation:

Technical differences:

Overall, this adds about 2.5K to the binary when enabled, with the new data adding about 3K, and the new code saving about 0.5K, and in the context of Windows Unicode binary, likely matters little.

avih commented 5 months ago

I wasn't sure if the new implementation should be inline like the existing implementation. I thought a new file is cleaner.

It's also possible to make the script only generate the data tables, while keeping the new code (wcwidth and intable) as part of the existing code.

But because the code is tied to the data (tables) format, I thought though it's better that the script generates both the data and the code which uses it.

Let me know if you have different preferences. I don't mind too much.

avih commented 5 months ago

I have an approach to automate compressed (16 bit) ranges, which roughly halves the data size, and is actually also faster.

I'll post it a bit later.

avih commented 5 months ago

Just added a commit which halves the data size, and is also considerably faster, but still reasonably simple.

See the commit message for details.

The two commits should eventually be squashed, but for now I kept them separate to make it easier to observe the changes.

I think it's useful to have a mroe modern wcwidth implementation, and I think this one is cute and tight.

Thoughts?

rmyorston commented 5 months ago

Look OK to me (though I'm no Unicode expert).

avih commented 5 months ago

Thanks.

  • There should probably be copyright/licence information in the source file as well as the script.

The 2nd commit did add copyright to the script, but I'll change it to the form found e.g. in mkconfigs (though many scripts don't have copyright). I'll keep the license MIT - same as at https://github.com/jquast/wcwidth/blob/master/LICENSE .

What kind of copyright should the generated file have? same as the script?

  • The function definition in libbb/wcwidth_alt.c should be int FAST_FUNC wcwidth(uint32_t ucs), or the 32-bit build fails. (FAST_FUNC is supposed to optimise function calls. It's only used in 32-bit builds.)

Right. I'll do that.

rmyorston commented 5 months ago

What kind of copyright should the generated file have? same as the script?

That's up to you. It has to be compatible with GPLv2 but other than that I don't mind.

avih commented 5 months ago

Force-pushed:

Please do go over the changes to ensure I didn't botch something.

rmyorston commented 4 months ago

Merged, thanks.

There are new prereleases, but only the Unicode binary will be affected.

avih commented 4 months ago

Thanks.