oilshell / oil

Oils is our upgrade path from bash to a better language and runtime. It's also for Python and JavaScript users who avoid shell!
http://www.oilshell.org/
Other
2.78k stars 150 forks source link

[fastfunc] Use the python2/oils string model in Utf8DecodeOne #1967

Closed PossiblyAShrub closed 1 month ago

PossiblyAShrub commented 1 month ago

Per the discussion in #1965, I've made the following updates:

andychu commented 1 month ago

Thinking about it a little more, I think the caller is responsible for not calling utf8_decode() on the trailing NUL, in the valid case

Because you don't want to get UTF8_OK in that case

But in the invalid case, the NUL that the caller supplied IS read, and that's OK, and it's necessary to return UTF8_TRUNCATED_BYTES


In other words, I think we actually don't need UTF8_END_OF_STREAM at all? I think we can just get rid of it, and make the caller is responsible

  1. NUL terminating every string
  2. not over-running the buffer -- i.e. don't call it when on the NUL past the end of the string, only ones before the end of the string (which it knows)

It is a bit weird and subtle, but I think it makes sense

(and this issue is why I was initially confused about the whole state machine / "inverting" the Crockford code)

PossiblyAShrub commented 1 month ago

Yeah, you were right about removing the END_OF_STREAM error state; it simplified the code while preserving correctness. The "nul-terminator required but you must keep track of the buffer end" rule is certainly subtle, so I made sure to note it in the doc-comment.

This is ready for another review.

andychu commented 1 month ago

Looks very nice now, thanks!

andychu commented 1 month ago

(thought)

One way to think about this is that utf8_decode() does NOT take a NUL terminated string

It is more like an "unsafe transducer" that takes a pointer, sorta like J8EncodeOne() and ShellEncodeOne()

It does "one" thing which happens to involve a variable number of bytes processed