kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.1k stars 61 forks source link

The spec says "code point" where it likely means "unicode scalar value" #207

Closed CAD97 closed 9 months ago

CAD97 commented 2 years ago

From the Unicode Glossary:

Code Point
(1) Any value in the Unicode codespace; that is, the range of integers from 0 to 10FFFF16. (See definition D10 in Section 3.4, Characters and Encoding.) Not all code points are assigned to encoded characters. See code point type. (2) A value, or position, for a character, in any coded character set.
Unicode Scalar Value
Any Unicode code point except high-surrogate and low-surrogate code points. In other words, the ranges of integers 0 to D7FF16 and E00016 to 10FFFF16 inclusive. (See definition D76 in Section 3.9, Unicode Encoding Forms.)

The first requirement of the spec is that "KDL documents should be UTF-8". Similarly, strings "MUST be represented as UTF-8 values." Note that a well-formed UTF-8 stream MUST NOT contain surrogate code points; it encodes a sequence of USV.

However, the formal grammar just says unicode (presumably, either [\0-\u{10FFFF}] or [\0-\u{D7FF}\u{E000}-\u{10FFFF}] — unclear (#191, #192)), and the prose spec allows "literal code points" and escaped "Code point described by hex characters". This allows the presence of surrogates, at a minimum as escaped code points, even if their literal inclusion is precluded by the higher-level requirement that the document be well-formed UTF-8.

At least one implementation documents this as a potential spec non-compliance. Given the requirement of UTF-8, I expect that this is just a terminology oversight, and the spec should say Unicode Scalar Value (or non-surrogate code point) in the two locations where it currently just says "code point".

tabatkins commented 2 years ago

The formal grammar referring to unicode as the set of all codepoints is fine, since valid UTF-8 documents can't encode the surrogate code points in the first place.

It's not technically necessary to restrict it from the unicode escape sequence either, tho it would complicated implementations by requiring them to store the values in something other than a UTF-8 string, and always output them escaped. So, I do recommend disallowing those code points, either by removing them from the escape's valid values, or allowing them but having those codepoints transform into U+FFFD REPLACEMENT CHARACTER.

The latter is, for example, what CSS does, but CSS has a requirement that all possible byte streams decode into something, while KDL is happy to reject entire documents that have a parse error. I don't have a strong opinion either way; it's generally an error to write such escapes in the first place.

zkat commented 9 months ago

This change is now included in the kdl-v2 branch.