Open vinc17fr opened 2 weeks ago
Cool bug.
I'd guess the issue is that when a utf-8 sequence is decoded, an unexpected byte value is considered part of the invalid codepoint encoding, without taking into account that it might still be a valid 1st/only byte of a new codepoint encoding.
I fixed the same issue not long ago in the busybox-w32 UTF-8 decoder.
printf "a\xe9bcd\n"
Not a great example.
POSIX doesn't define \x
in a shell string, and in C string such hex-sequence consists of all the hex digits it can collect (even if it ends up a single byte value - i.e. normal non-wide string, and in this example - up to the \n
), so it's not obvious that the shell would interpret it as \xe9
followed by the rest literally. It certainly wouldn't be interpreted as such in C.
A better example would be to use an octal literal, which is specified in POSIX:
printf "a\351bcd\n"
(my commit message above has the same issue, though it's specific to busybox, and busybox-ash does take at most two hex digits after \x
)
Should be fixed in 86ed7800341e4d29f6a1980657fad21a06e0a511.
I did various tests, and I couldn't find any issue. Thanks.
Run the following command with UTF-8 locales:
(the text is "a" followed by the byte E9, followed by "bcd" and a newline character), then search for
b
orc
. One gets "Pattern not found", whilea
andd
are found. This occurs withless
661 under both Debian and Android/Termux.Note that there is no such issue with GNU
grep
, i.e.printf "a\xe9bcd\n" | grep -a b
outputs the line.