Closed PossiblyAShrub closed 1 month ago
Thinking about it a little more, I think the caller is responsible for not calling utf8_decode() on the trailing NUL, in the valid case
Because you don't want to get UTF8_OK in that case
But in the invalid case, the NUL that the caller supplied IS read, and that's OK, and it's necessary to return UTF8_TRUNCATED_BYTES
In other words, I think we actually don't need UTF8_END_OF_STREAM at all? I think we can just get rid of it, and make the caller is responsible
It is a bit weird and subtle, but I think it makes sense
(and this issue is why I was initially confused about the whole state machine / "inverting" the Crockford code)
Yeah, you were right about removing the END_OF_STREAM
error state; it simplified the code while preserving correctness. The "nul-terminator required but you must keep track of the buffer end" rule is certainly subtle, so I made sure to note it in the doc-comment.
This is ready for another review.
Looks very nice now, thanks!
(thought)
One way to think about this is that utf8_decode() does NOT take a NUL terminated string
It is more like an "unsafe transducer" that takes a pointer, sorta like J8EncodeOne() and ShellEncodeOne()
It does "one" thing which happens to involve a variable number of bytes processed
Per the discussion in #1965, I've made the following updates:
fastfunc.Utf8DecodeOne
"safe" by exposing the python2/oils string model (over the C one)osh/string_ops.py
UTF-8 functions (fastfunc.Utf8DecodeOne
now handles that)Str => trim*()
are resilient to zero-codepoints