[fastfunc] Use the python2/oils string model in Utf8DecodeOne

PossiblyAShrub commented 1 month ago

Per the discussion in #1965, I've made the following updates:

Made the fastfunc.Utf8DecodeOne "safe" by exposing the python2/oils string model (over the C one)
Removed the NUL special casing in our osh/string_ops.py UTF-8 functions (fastfunc.Utf8DecodeOne now handles that)
Added some tests to validate that string APIs like Str => trim*() are resilient to zero-codepoints

andychu commented 1 month ago

Thinking about it a little more, I think the caller is responsible for not calling utf8_decode() on the trailing NUL, in the valid case

Because you don't want to get UTF8_OK in that case

But in the invalid case, the NUL that the caller supplied IS read, and that's OK, and it's necessary to return UTF8_TRUNCATED_BYTES

In other words, I think we actually don't need UTF8_END_OF_STREAM at all? I think we can just get rid of it, and make the caller is responsible

NUL terminating every string
not over-running the buffer -- i.e. don't call it when on the NUL past the end of the string, only ones before the end of the string (which it knows)

It is a bit weird and subtle, but I think it makes sense

(and this issue is why I was initially confused about the whole state machine / "inverting" the Crockford code)

PossiblyAShrub commented 1 month ago

Yeah, you were right about removing the END_OF_STREAM error state; it simplified the code while preserving correctness. The "nul-terminator required but you must keep track of the buffer end" rule is certainly subtle, so I made sure to note it in the doc-comment.

This is ready for another review.

andychu commented 1 month ago

Looks very nice now, thanks!

andychu commented 1 month ago

(thought)

One way to think about this is that utf8_decode() does NOT take a NUL terminated string

It is more like an "unsafe transducer" that takes a pointer, sorta like J8EncodeOne() and ShellEncodeOne()

It does "one" thing which happens to involve a variable number of bytes processed

oilshell / oil

[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967