kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.07k stars 61 forks source link

Newline normalization #360

Closed marrus-sh closed 5 months ago

marrus-sh commented 7 months ago

As far as I can tell, the following is true according to the current spec.

This seems to me like it would very rarely be useful. I think KDL processors should be expected to normalize some newlines prior to processing. U+000C (FF) and U+2029 (PSEP), which have additional semantics beyond being just a newline, should not be normalized.

I think the rules from XML 1.1 are probably a good place to start :⁠—

To simplify the tasks of applications, the XML processor must behave as if it normalized all line breaks in external parsed entities (including the document entity) on input, before parsing, by translating all of the following to a single #xA character:

  • the two-character sequence #xD #xA

  • the two-character sequence #xD #x85

  • the single character #x85

  • the single character #x2028

  • any #xD character that is not immediately followed by #xA or #x85.

This normalizes CRLF, CRNEL, NEL, LSEP, and CR (not followed by LF or NEL) to LF, without touching FF or PSEP. These characters can still be included in multiline strings via character escapes, if they should be necessary.

zkat commented 7 months ago

I don't know about FF/PSEP, but I really don't think we should normalize newlines in string values. They may very well be meaningful (consider, perhaps, HTTP headers, when you author something on Linux). The point of the normalization rules is to allow KDL documents themselves to have the KDL syntax structure itself work across platforms easily.

tabatkins commented 7 months ago

Agreed, newlines shouldn't be normalized in strings. There's a lot of potential platform differences when editting files, far more than just newlines, and trying to handle that automatically at the data layer is extremely fraught.

marrus-sh commented 7 months ago

This stance is confusing to me. Is the expectation that every KDL query which is interested in a string which contains newlines must manually check for every possible newline option and then normalize the result? Or that KDL queries cannot be expected to be cross‐platform?

XML, HTML, CSS, and JavaScript all normalize newlines, so that script authors don’t have to manually do newline normalization every time they encounter a string.

HTML (§13.2.3.5 Preprocessing the input stream):

Before the tokenization stage, the input stream must be preprocessed by normalizing newlines. Thus, newlines in HTML DOMs are represented by U+000A LF characters, and there are never any U+000D CR characters in the input to the tokenization stage.

CSS (§3.3. Preprocessing the input stream):

The input stream consists of the filtered code points pushed into it as the input byte stream is decoded.

To filter code points from a stream of (unfiltered) code points input:

  • Replace any U+000D CARRIAGE RETURN (CR) code points, U+000C FORM FEED (FF) code points, or pairs of U+000D CARRIAGE RETURN (CR) followed by U+000A LINE FEED (LF) in input by a single U+000A LINE FEED (LF) code point.

  • Replace any U+0000 NULL or surrogate code points in input with U+FFFD REPLACEMENT CHARACTER (�).

JavaScript (§12.9.6.2 Static Semantics: TRV):

NOTE TV excludes the code units of LineContinuation while TRV includes them. <CR><LF> and <CR> LineTerminatorSequences are normalized to <LF> for both TV and TRV. An explicit TemplateEscapeSequence is needed to include a <CR> or <CR><LF> sequence.

Of course the exact bytes used for a newline sometimes matter. But in these cases the newline can simply be encoded using character escapes. It seems wrong to prioritize this specialized, technical case over the common one, and in violation of current common practice in both document languages and programming ones.

tabatkins commented 7 months ago

Hm, CSS isn't relevant here, since you can't include literal newlines in strings. JS also excludes literal newlines in normal strings, but does indeed normalize literal newlines in template strings. XML/HTML fully normalizes, tho it also mixes content and structure in an untangleable way, so it's harder to use it as justification. YAML normalizes newlines inside its multiline strings. JSON can't contain literal newlines in its strings.

It looks like when you can have multiline strings, the convention is to normalize the newlines, @zkat. I suspect we should follow that convention, then.

zkat commented 7 months ago

@tabatkins interesting. Do they all normalize to \n, then, regardless of engine/platform?

tabatkins commented 6 months ago

Yup, the data model presented to code is always \n, even if their serializers might output platform-specific newlines.

zkat commented 5 months ago

This is done.