rmyorston / busybox-w32

WIN32 native port of BusyBox.
https://frippery.org/busybox
Other
670 stars 124 forks source link

win32: UTF8_OUTPUT: recover quicker from bad byte #383

Closed avih closed 6 months ago

avih commented 6 months ago

When an unexpected value is detected in UTF-8, we should print the placeholder codepoint, and then recover whenever we detect a value which is valid for starting a new UTF-8 codepoint (including ASCII7).

However, previously, we only tested recovery at the byte following the unexpected one, and so if the first unexpected value was also valid for a new codepoint, then we ignored it (and the rest of the codepoint if it wasn't ASCII7).

Now we check for recovery from the first unexpected byte, which, if recoverable, requires both placeholder printout and recovery, so the recovery "unwinding" is modified a bit to allow placeholder.

Example of of a sequence which now recovers quicker than before:

printf "\xF0\xF0\x9F\x98\x80A"

Where UTF-8 for U+1F600 "😀" is: 0xF0 0x9F 0x98 0x80 Previously: ?A Now: ?😀A

avih commented 6 months ago

Force pushed. Fixed a typo at the source code: palceholder -> placeholder, and refined the commit message very slightly.

rmyorston commented 6 months ago

Works as advertised, thanks. Prerelease binaries are available.

avih commented 6 months ago

Thanks.