Open warmwaffles opened 3 years ago
I stumbled across this and had a look: The problem is, that while utf8spn
/utf8cspn
are logically equivalent to strspn
/strcspn
, their results are codepoints not bytes, so they can not be simply added to char pointers. Besides that, the API of stroke_r
demands to set the first parameter to NULL
on subsequent calls, so combining this and adding another helper (didn't find something matching in utf8.h
) I came up with:
void *utf8incr(void *utf8_restrict str, size_t len) {
char* s = (char*) str;
while(*s && len--) {
size_t l = utf8codepointcalcsize(s);
while(*s && l--) ++s;
}
return s;
}
void *utf8tok_r(void *utf8_restrict str, const void *utf8_restrict sep, void **utf8_restrict ptr) {
char* s = (char*) str;
char** p = (char**) ptr;
if (!s && !(s = *p)) {
return NULL;
}
s = utf8incr(s, utf8spn(s, sep));
if (!*s) {
return *p = 0;
}
*p = utf8incr(s, utf8cspn(s, sep));
if (**p) {
*(*p)++ = 0;
} else {
*p = 0;
}
return s;
}
And as a small change to the test:
UTEST(utf8tok_r, token_walking) {
char* string = utf8dup("this|aäáé|föőf|that|");
char* ptr = NULL;
ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "this", 4));
string = NULL;
ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "aäáé", 4));
ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "föőf", 4));
ASSERT_EQ(0, utf8ncmp(utf8tok_r(string, "|", &ptr), "that", 4));
free(string);
}
@sheredom this is a pretty interesting find
I've been playing with adding
utf8tok
but the problem with the original implementation is that it is not re-entrant.I've been looking at musl at how they implemented
utf8tok_r
and it's relatively simple. hereThe following is the implemented test (it fails at the assert for
föőf
.After playing with this for a bit, I am kind of at a loss for what to do.
Anyways, leaving this here in case someone else wants to pick it up and go on.