`utf8nvalid` reads out bounds

z-erica commented 2 years ago

The utf8nvalid procedure fails to respect the n parameter when the string ends in a multibyte codepoint. In those cases, it will read past it when ensuring the codepoint is terminated; the bounds check does not include the later str[2]:

      /* ensure that there's 2 bytes or more remained */
      if (remained < 2) {
        return (utf8_int8_t *)str;
      }

      /* ensure the 1 following byte in this 2-byte
       * utf8 codepoint began with 0b10xxxxxx */
      if (0x80 != (0xc0 & str[1])) {
        return (utf8_int8_t *)str;
      }

      /* ensure that our utf8 codepoint ended after 2 bytes */
      if (0x80 == (0xc0 & str[2])) {
        return (utf8_int8_t *)str;
      }

This fails in cases such as the following, where a string is unterminated:

#include <assert.h>
#include <string.h>
#include "utf8.h"

int main(int argc, char** argv) {
    const char terminated[] = "\xc2\xa3"; // UTF-8 encoding of U+00A3 (pound sign)
    size_t terminated_length = strlen(terminated);

    const char memory[] = "\xff\xff\xff\xff"
                          "\xc2\xa3"
                          "\x80\xff\xff\xff";

    const char* unterminated_begin = &memory[4];
    const char* unterminated_end = &memory[strlen(memory) - 4];
    size_t unterminated_length = unterminated_end - unterminated_begin;

    assert(terminated_length == unterminated_length);
    assert(strncmp(terminated, unterminated_begin, unterminated_length) == 0);
    // The two strings are identical within the bounds that are passed to
    // utf8nvalid, so we would expect these two tests to pass.
    assert(utf8nvalid(terminated, terminated_length) == NULL);
    assert(utf8nvalid(unterminated_begin, unterminated_length) == NULL); // fails!
}

sheredom commented 2 years ago

Could you try https://github.com/sheredom/utf8.h/pull/103 please?

z-erica commented 2 years ago

Seems like it runs fine now! Thank you very much.

sheredom / utf8.h

`utf8nvalid` reads out bounds #102