Closed CAD97 closed 9 months ago
The formal grammar referring to unicode
as the set of all codepoints is fine, since valid UTF-8 documents can't encode the surrogate code points in the first place.
It's not technically necessary to restrict it from the unicode escape sequence either, tho it would complicated implementations by requiring them to store the values in something other than a UTF-8 string, and always output them escaped. So, I do recommend disallowing those code points, either by removing them from the escape's valid values, or allowing them but having those codepoints transform into U+FFFD REPLACEMENT CHARACTER.
The latter is, for example, what CSS does, but CSS has a requirement that all possible byte streams decode into something, while KDL is happy to reject entire documents that have a parse error. I don't have a strong opinion either way; it's generally an error to write such escapes in the first place.
This change is now included in the kdl-v2 branch.
From the Unicode Glossary:
The first requirement of the spec is that "KDL documents should be UTF-8". Similarly, strings "MUST be represented as UTF-8 values." Note that a well-formed UTF-8 stream MUST NOT contain surrogate code points; it encodes a sequence of USV.
However, the formal grammar just says
unicode
(presumably, either[\0-\u{10FFFF}]
or[\0-\u{D7FF}\u{E000}-\u{10FFFF}]
— unclear (#191, #192)), and the prose spec allows "literal code points" and escaped "Code point described by hex characters". This allows the presence of surrogates, at a minimum as escaped code points, even if their literal inclusion is precluded by the higher-level requirement that the document be well-formed UTF-8.At least one implementation documents this as a potential spec non-compliance. Given the requirement of UTF-8, I expect that this is just a terminology oversight, and the spec should say Unicode Scalar Value (or non-surrogate code point) in the two locations where it currently just says "code point".