The utf8nvalid procedure fails to respect the n parameter when the string ends in a multibyte codepoint. In those cases, it will read past it when ensuring the codepoint is terminated; the bounds check does not include the later str[2]:
/* ensure that there's 2 bytes or more remained */
if (remained < 2) {
return (utf8_int8_t *)str;
}
/* ensure the 1 following byte in this 2-byte
* utf8 codepoint began with 0b10xxxxxx */
if (0x80 != (0xc0 & str[1])) {
return (utf8_int8_t *)str;
}
/* ensure that our utf8 codepoint ended after 2 bytes */
if (0x80 == (0xc0 & str[2])) {
return (utf8_int8_t *)str;
}
This fails in cases such as the following, where a string is unterminated:
#include <assert.h>
#include <string.h>
#include "utf8.h"
int main(int argc, char** argv) {
const char terminated[] = "\xc2\xa3"; // UTF-8 encoding of U+00A3 (pound sign)
size_t terminated_length = strlen(terminated);
const char memory[] = "\xff\xff\xff\xff"
"\xc2\xa3"
"\x80\xff\xff\xff";
const char* unterminated_begin = &memory[4];
const char* unterminated_end = &memory[strlen(memory) - 4];
size_t unterminated_length = unterminated_end - unterminated_begin;
assert(terminated_length == unterminated_length);
assert(strncmp(terminated, unterminated_begin, unterminated_length) == 0);
// The two strings are identical within the bounds that are passed to
// utf8nvalid, so we would expect these two tests to pass.
assert(utf8nvalid(terminated, terminated_length) == NULL);
assert(utf8nvalid(unterminated_begin, unterminated_length) == NULL); // fails!
}
The
utf8nvalid
procedure fails to respect then
parameter when the string ends in a multibyte codepoint. In those cases, it will read past it when ensuring the codepoint is terminated; the bounds check does not include the laterstr[2]
:This fails in cases such as the following, where a string is unterminated: