utf8ncpy incorrectly removes last valid codepoint

curoles commented 1 year ago

In the code:

utf8_int8_t *utf8ncpy(utf8_int8_t *utf8_restrict dst,
                      const utf8_int8_t *utf8_restrict src, size_t n) {
 ...
  for (check_index = index - 1;
       check_index > 0 && 0x80 == (0xc0 & d[check_index]); check_index--) {
    /* just moving the index */
  }

For code points >7F 0xc0 is valid mask for 1st byte, for rest it is 0x80 (https://en.wikipedia.org/wiki/UTF-8).

Consider string °¯\_(ツ)_/¯°

let's add a printout:

     /* just moving the index */
      printf("index:%lu byte:%x cond:%u\n", check_index, (unsigned)(unsigned char)d[check_index],
              0x80 == (0xc0 & d[check_index]));

copy index:0 byte:c2
copy index:1 byte:b0
copy index:2 byte:c2
copy index:3 byte:af
copy index:4 byte:5c
copy index:5 byte:5f
copy index:6 byte:28
copy index:7 byte:e3
copy index:8 byte:83
copy index:9 byte:84
copy index:10 byte:29
copy index:11 byte:5f
copy index:12 byte:2f
copy index:13 byte:c2
copy index:14 byte:af
copy index:15 byte:c2
copy index:16 byte:b0
copy index:17 byte:0
found null
index:16 byte:b0 cond:1
index:15 byte:c2 cond:0

Following after that code will chop last valid code point:

  if (check_index < index &&
      (index - check_index) < utf8codepointsize(d[check_index])) { //(17-15)=2 < utf8codepointsize=4
    index = check_index;
  }

FIX

Fix that worked for me: (index - check_index) < utf8codepointcalcsize(&d[check_index]))

The problem with using utf8codepointsize:

utf8_constexpr14_impl size_t utf8codepointsize(utf8_int32_t chr) {
  if (0 == ((utf8_int32_t)0xffffff80 & chr)) {
    return 1;
  } else if (0 == ((utf8_int32_t)0xfffff800 & chr)) {
    return 2;
  } else if (0 == ((utf8_int32_t)0xffff0000 & chr)) {
    return 3;
  } else { /* if (0 == ((int)0xffe00000 & chr)) { */
    return 4;
  }
}

is that c2 becomes ffffffc2 and none of the 0xffffxxxx & chr == 0

sheredom commented 1 year ago

Happy to accept a PR for this if you are willing!

curoles commented 1 year ago

happy to create PR https://github.com/sheredom/utf8.h/pull/110

richgel999 commented 11 months ago

This is a scary bug - is it going to be merged in?

sheredom commented 11 months ago

The linked PR fails CI tests and @curoles said they would look at the results. There hasn't been any follow-up from there.

curoles commented 11 months ago

Sorry, I got distracted and forgot to leave a comment. See https://github.com/sheredom/utf8.h/pull/110. Test UTEST(utf8ncpy, truncated_copy_null_terminated) fails. Maybe someone with better understanding of what this test is expecting should have a look.

curoles commented 11 months ago

I believe I fixed one more issue when destination string it too small to store codepoint plus null terminator. All tests are passing now, nevertheless, check if my last fix makes sense.

sheredom commented 11 months ago

Good good! Thanks for the fix, I always appreciate it when people chip in and make the project better!

sheredom / utf8.h

utf8ncpy incorrectly removes last valid codepoint #109

FIX