search fails to find match that is 1 or 2 characters after an invalid UTF-8 sequence

gwsw / less

Less - text pager

http://greenwoodsoftware.com/less

Other

533 stars 84 forks source link

search fails to find match that is 1 or 2 characters after an invalid UTF-8 sequence #542

Open vinc17fr opened 2 weeks ago

vinc17fr commented 2 weeks ago

Run the following command with UTF-8 locales:

printf "a\xe9bcd\n" | less

(the text is "a" followed by the byte E9, followed by "bcd" and a newline character), then search for b or c. One gets "Pattern not found", while a and d are found. This occurs with less 661 under both Debian and Android/Termux.

Note that there is no such issue with GNU grep, i.e. printf "a\xe9bcd\n" | grep -a b outputs the line.

avih commented 2 weeks ago

Cool bug.

I'd guess the issue is that when a utf-8 sequence is decoded, an unexpected byte value is considered part of the invalid codepoint encoding, without taking into account that it might still be a valid 1st/only byte of a new codepoint encoding.

I fixed the same issue not long ago in the busybox-w32 UTF-8 decoder.

avih commented 2 weeks ago

printf "a\xe9bcd\n"

Not a great example.

POSIX doesn't define \x in a shell string, and in C string such hex-sequence consists of all the hex digits it can collect (even if it ends up a single byte value - i.e. normal non-wide string, and in this example - up to the \n), so it's not obvious that the shell would interpret it as \xe9 followed by the rest literally. It certainly wouldn't be interpreted as such in C.

A better example would be to use an octal literal, which is specified in POSIX:

printf "a\351bcd\n"

(my commit message above has the same issue, though it's specific to busybox, and busybox-ash does take at most two hex digits after \x)

gwsw commented 2 weeks ago

Should be fixed in 86ed7800341e4d29f6a1980657fad21a06e0a511.

vinc17fr commented 2 weeks ago

I did various tests, and I couldn't find any issue. Thanks.