kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.09k stars 61 forks source link

Definition of a character, UTF-8 weirdness #265

Closed oldaccountdeadname closed 2 years ago

oldaccountdeadname commented 2 years ago

The spec mentions that kdl documents are all UTF-8 encoded, and terminals are defined in terms of characters. I may be missing this, but is character defined anywhere? Is it a codepoint? A grapheme cluster? Concretely, could a valid identifier, be, say, a control character?

I'm not sure how precise I should make my implementation - iterating by grapheme cluster is significantly more complex than iteration by codepoint, but if one is correct, then that's what should be done.

Thanks so much for looking over this and also writing out this lang; it's very useful!

oldaccountdeadname commented 2 years ago

Briefly looking at kdl-rs, I think any non-surrogate codepoint is interpreted as a character, given how Rust deals with chars, and how little Nom seems to care. Is this the recommended/mandated approach?

zkat commented 2 years ago

I say just do it by codepoint? :)

oldaccountdeadname commented 2 years ago

Okay, thanks!