utf8tok and utf8tok_r - Githubissues

warmwaffles commented 3 years ago

I've been playing with adding utf8tok but the problem with the original implementation is that it is not re-entrant.

I've been looking at musl at how they implemented utf8tok_r and it's relatively simple. here

void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
  char* s = (char*) str;
  char** p = (char**) ptr;

  if (!s && !(s = *p)) {
    return NULL;
  }

  s += utf8spn(s, sep);
  if (!*s) {
    return *p = 0;
  }

  *p = s + utf8cspn(s, sep);
  if (**p) {
    *(*p)++ = 0;
  } else {
    *p = 0;
  }

  return s;
}

The following is the implemented test (it fails at the assert for föőf.

UTEST(utf8tok_r, token_walking) {
    char* string = utf8dup("this|aäáé|föőf|that|");
    char* ptr = NULL;

    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aäáé", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "föőf", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));

    free(string);
}

After playing with this for a bit, I am kind of at a loss for what to do.

Anyways, leaving this here in case someone else wants to pick it up and go on.

gulrak commented 2 years ago

I stumbled across this and had a look: The problem is, that while utf8spn/utf8cspn are logically equivalent to strspn/strcspn, their results are codepoints not bytes, so they can not be simply added to char pointers. Besides that, the API of stroke_r demands to set the first parameter to NULL on subsequent calls, so combining this and adding another helper (didn't find something matching in utf8.h) I came up with:

void *utf8incr(void *utf8_restrict str, size_t len) {
    char* s = (char*) str;
    while(*s && len--) {
        size_t l = utf8codepointcalcsize(s);
        while(*s && l--) ++s;
    }
    return s;
}

void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
  char* s = (char*) str;
  char** p = (char**) ptr;

  if (!s && !(s = *p)) {
    return NULL;
  }

  s = utf8incr(s, utf8spn(s, sep));
  if (!*s) {
    return *p = 0;
  }

  *p = utf8incr(s, utf8cspn(s, sep));
  if (**p) {
    *(*p)++ = 0;
  } else {
    *p = 0;
  }

  return s;
}

And as a small change to the test:

UTEST(utf8tok_r, token_walking) {
    char* string = utf8dup("this|aäáé|föőf|that|");
    char* ptr = NULL;

    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
    string = NULL;
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aäáé", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "föőf", 4));
    ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));

    free(string);
}

warmwaffles commented 2 years ago

@sheredom this is a pretty interesting find

sheredom / utf8.h

utf8tok and utf8tok_r #89