Closed Nairou closed 3 years ago
That sounds like a reasonable idea - its likely just an oversight when I wrote the code.
Fancy doing a PR? Otherwise I can add it to the growing worklist!
I've been thinking about ways this could be implemented. Tracking codepoint position during the copy would probably prevent the compiler from optimizing the copy into a memcpy.
The other option I see is to do an automatic call to utf8valid()
after the copy, to insert a null terminator at the invalid point. Or, walk through the original string with utf8codepointsize()
until we hit the destination size limit, so we know how much to copy. Either way requires iterating over the entire string a second time.
Do you see any alternative?
My guess would be (and its just a guess!) that the best way would be to insert a new check in https://github.com/sheredom/utf8.h/blob/master/utf8.h#L585 (between the copy from src -> dst and the null byte appending) to check that the last bytes inserted were a full codepoint (look back through the bytes until you find something that doesn't start with 0b10xxxxxx
), and then make sure that you had enough bytes for the size of the codepoint?
That way the main loop should still be optimal, and you'll only have a smaller saving afterwards?
It looks like
utf8cpy
will copy the entire string, but makes an assumption about the destination being big enough, whereasutf8ncpy
allows you to specify a destination buffer size limit, but risks creating an invalid result if the source string is longer.I'm curious when this second result is ever desirable? If I'm working with utf8 strings, and I want to limit a string to a certain buffer size, shouldn't it crop the string at a valid code point?