kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.1k stars 61 forks source link

v2.0 additional restricted literal characters #250

Closed tabatkins closed 9 months ago

tabatkins commented 2 years ago

Currently, idents disallow a few characters from being expressed literally, requiring they be escaped if authors want to include them:

I think there's a few more we can reasonably restrict to make KDL documents more readable/understandable:

Removing 0x7F just seems like fixing an omission; it's easy to forget that the ASCII control characters aren't contiguous.

Removing the direction-control characters helps keep KDL source readable; the direction override characters in particular are somewhat fraught to show up in plain-text documents, as they can corrupt the display of following text in the wrong direction (as demonstrated in the recent somewhat-hyperbolic complaints about them showing up in Rust and other source languages as a possible review-attack). If these character are desired for use in text values, such as strings, they can still be escaped; their literal usage in what is otherwise an ASCII-based language is virtually always either accidental or malicious, since they're intended for text formatting and have no semantic meaning.

The BOM is allowed at the start of a KDL document

(A previous issue suggested restricting the surrogate-pair characters as well (0xD800-DFFF); these are already restricted implicitly by the requirement that KDL documents be encoded in UTF-8, where such codepoints can't be validly encoded. As such I'm continuing to omit them from these suggestions.)

While there are still a number of "invisible" characters in Unicode that could potentially be confusing or accidental, they also have semantic uses, so I don't currently recommend restricting them.

marrus-sh commented 2 years ago

I think you should not remove direction characters as they make it impossible to literally encode bidirectional strings, which is important for internationalization. It is true that BIDI control characters can create review‐attacks, but KDL is not a programming language and the probability of someone it needing to encode lengthy strings (which may include bidirectional text) is pretty high. Linters and formatters can be used by individual projects to detect and warn about the use of these characters if needed; there is no reason to forbid them at a language level.

I would suggest disallowing exactly the same characters as RestrictedChar in XML 1.1, plus U+0000, U+FFFE, and U+FFFF (which are not allowed to be escaped in XML either).

Lucretiel commented 2 years ago

I think you should not remove direction characters as they make it impossible to literally encode bidirectional strings, which is important for internationalization.

To be clear, the proposal is only to remove them from identifiers like node. There's nothing stopping someone from using them (in either literal or escaped form) in a quoted string.

tabatkins commented 2 years ago

Well, my post was unclear; I talked about the ident restrictions at first, but then later mentioned being able to include them in strings via escapes.

But yeah, I think just talking about idents is fine. (Notably, you can't escape anything in raw strings, which would be somewhat limiting.)

zkat commented 9 months ago

These changes have been merged into the kdl-v2 branch