Fix and simplify identifier character grammar

kdl-org / kdl

the kdl document language specifications

https://kdl.dev

Other

1.1k stars 61 forks source link

Fix and simplify identifier character grammar #192

Closed larsgw closed 2 years ago

larsgw commented 2 years ago

remove some of the "minus" syntax: it is only used in this part of the grammar AFAIK, and from my own experience, it can be confusing
add \x00-\x20 to the list of non-identifier chars

Fix #191

larsgw commented 2 years ago

Important: this also removes 0x7F, but I can put that back

CAD97 commented 2 years ago

Related note: due to the requirement for the document to be UTF-8, unicode potentially is meant to refer to Unicode Scalar Values; that is, [\0-\u{D7FF}\u{E000}-\u{10FFFF}], not [\0-\u{10FFFF}] (see #207).

zkat commented 2 years ago

Moving this to target the v2 branch since it includes breaking changes.

tabatkins commented 2 years ago

Related note: due to the requirement for the document to be UTF-8, unicode potentially is meant to refer to Unicode Scalar Values; that is, [\0-\u{D7FF}\u{E000}-\u{10FFFF}], not [\0-\u{10FFFF}] (see #207).

Since the requirement for UTF-8 already exists, this distinction is moot; you can't validly encode the surrogate codepoints into UTF-8 anyway.

tabatkins commented 2 years ago

This change (almost certainly unintentionally) would allow non-ASCII linespace as valid ident chars, which I think would be a bad idea. (Manually resolving a subtraction away can be tricky!)

larsgw commented 2 years ago

I'll try to fix it when I can.

tabatkins commented 2 years ago

I'd prefer this not be merged. It currently changes the grammar in two ways, but even when those are fixed, I personally found the minus syntax perfectly readable and easy to translate to code. This PR doesn't remove all the minuses, anyway - we're left with - keyword.

More importantly, tho, if we fix #200 with #241 in v2, then this won't be complete anyway, and will need further non-trivial revision. The #241 fix, on the other hand, uses minus more heavily, and imo remains very readable.