utf8rchr issue - Githubissues

coemge commented 8 months ago

Hello, I think I have found a bug in the utf8rchr code. On some occasions it skips past the null terminator of the string and continues reading until it finds the specified character. Running the following code under gcc produced the problem:

#include <stdio.h>
#include "utf8.h"

int main()
{
    char *s1 = "Hello";
    char *s2 = "Hello  ";
    char *result = utf8rchr(s1, 'o');

    printf("String pointer: %llx\n", s1);
    printf("Char pointer  : %llx\n", result);
    printf("Index         : %d\n", result-s1);

    return 0;
}

This code produced the following output, indicating that the last occurrence of 'o' was at index 10 (it should be at 4), well past the end of the string.

Output from gcc:

String pointer: 7ff659d368d4
Char pointer  : 7ff659d368de
Index         : 10

Looking at the code for utf8rchr, I believe the problem code is where offset is being incremented by 2 (for a single byte ascii character) instead of 1, and skipping the Null terminator:

    while (src[offset] == c[offset]) {
      offset++;
    }

This doesn't occur on all occasions. I suspect that if the code encounters multiple Null characters before it finds another occurrence of the search character, it works okay.

sheredom commented 8 months ago

Good find! PR in with a fix.

coemge commented 8 months ago

Thanks for the quick update, but now, when there are multiple occurrences of the character in the string, strrchr is returning a pointer to the second-last occurrence, not the last. The following code and output shows the problem.

#include <stdio.h>
#include "utf8.h"

int main()
{
    char *s1 = "Helloo";
    char *s2 = "Heloo";
    char *s3 = "Helll";
    char *result;

    result = utf8rchr(s1, 'o');
    printf("String pointer: %llx\n", s1);
    printf("Char pointer  : %llx\n", result);
    printf("Index         : %d\n\n", result-s1);

    result = utf8rchr(s2, 'o');
    printf("String pointer: %llx\n", s2);
    printf("Char pointer  : %llx\n", result);
    printf("Index         : %d\n\n", result-s2);

    result = utf8rchr(s3, 'l');
    printf("String pointer: %llx\n", s3);
    printf("Char pointer  : %llx\n", result);
    printf("Index         : %d\n\n", result-s3);

    return 0;
}

OUTPUT
---------
String pointer: 7ff7a59b78d4
Char pointer  : 7ff7a59b78d8
Index         : 4           // this should be 5

String pointer: 7ff7a59b78db
Char pointer  : 7ff7a59b78de
Index         : 3           // this should be 4

String pointer: 7ff7a59b78e1
Char pointer  : 7ff7a59b78e4
Index         : 3           // this should be 4

It is working fine when there is only one occurrence of the character in the string.

I think the problem is in the following code, where src is being incremented by offset, and then src[offset] is used to test for discontinuation. I think that the test for discontinuation needs to occur before src is incremented. I am going to test this now.

    if ('\0' == c[offset]) {
      /* we found a matching utf8 code point */
      match = (utf8_int8_t *)src;
      src += offset;

      if ('\0' == src[offset]) {
        break;
      }

I have just run my original code again with the following change, along with some other tests, and it appears to be working fine. I have also run this code with multiple 2,3 and 4 byte characters and it appears to work.

    if ('\0' == c[offset]) {
      /* we found a matching utf8 code point */
      match = (utf8_int8_t *)src;

      if ('\0' == src[offset]) {
        break;
      }
      src += offset;    /* this line moved to after discontinuation test above */

    } else {

sheredom commented 8 months ago

That was dumb of me - fixed in https://github.com/sheredom/utf8.h/pull/116

sheredom / utf8.h

utf8rchr issue #114