sheredom / utf8.h

📚 single header utf8 string functions for C and C++
The Unlicense
1.69k stars 123 forks source link

Copy string to limited buffer, without risking invalid result? #77

Closed Nairou closed 3 years ago

Nairou commented 3 years ago

It looks like utf8cpy will copy the entire string, but makes an assumption about the destination being big enough, whereas utf8ncpy allows you to specify a destination buffer size limit, but risks creating an invalid result if the source string is longer.

I'm curious when this second result is ever desirable? If I'm working with utf8 strings, and I want to limit a string to a certain buffer size, shouldn't it crop the string at a valid code point?

sheredom commented 3 years ago

That sounds like a reasonable idea - its likely just an oversight when I wrote the code.

Fancy doing a PR? Otherwise I can add it to the growing worklist!

Nairou commented 3 years ago

I've been thinking about ways this could be implemented. Tracking codepoint position during the copy would probably prevent the compiler from optimizing the copy into a memcpy.

The other option I see is to do an automatic call to utf8valid() after the copy, to insert a null terminator at the invalid point. Or, walk through the original string with utf8codepointsize() until we hit the destination size limit, so we know how much to copy. Either way requires iterating over the entire string a second time.

Do you see any alternative?

sheredom commented 3 years ago

My guess would be (and its just a guess!) that the best way would be to insert a new check in https://github.com/sheredom/utf8.h/blob/master/utf8.h#L585 (between the copy from src -> dst and the null byte appending) to check that the last bytes inserted were a full codepoint (look back through the bytes until you find something that doesn't start with 0b10xxxxxx), and then make sure that you had enough bytes for the size of the codepoint?

That way the main loop should still be optimal, and you'll only have a smaller saving afterwards?