kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.09k stars 61 forks source link

Merge KDL v2 #286

Open zkat opened 1 year ago

zkat commented 1 year ago

Here it is! The long-awaited KDL v2, which is where we go ahead and make a handful of technically-breaking changes to address some corner cases we've run into over the past year while KDL has been getting implemented in a bunch of languages by various people.

I'd love to get feedback on what we have slated, and whether there's anything else we should definitely include when this goes out.

zkat commented 1 year ago

/cc @CAD97

CAD97 commented 1 year ago

I have a slight preference for #241 over #204 personally, though only slight.

hkolbeck commented 1 year ago

I have a preference for #204, because the primary use case I can see for # in bare identifiers is hashtag-like which would be illegal under either, and it seems better to go with the simpler rule.

That preference is not terribly strong, though.

Edit: I misread, I'm fine with either

CAD97 commented 1 year ago

the primary use case I can see for # in bare identifiers is hashtag-like

To clarify, #241 allows #ident as a bare ident, and both will of course still allow "r#ident" as a quoted ident.

Argument for allowing: transliterating CSS selectors, for e.g. CSS-in-KDL. Argument against allowing: using the syntax in KQL as a selector like CSS.

lilyball commented 1 year ago

Argument for allowing: transliterating CSS selectors, for e.g. CSS-in-KDL. Argument against allowing: using the syntax in KQL as a selector like CSS.

#foo in CSS is special-casing the id attribute. KQL doesn't have an equivalent to HTML's id, and using #foo syntax in KQL to mean something else might be confusing given its meaning in CSS, so I don't find the argument against compelling.

My inclination is to prefer #241 as well, as I think being able to write hashtags is neat. It also allows for doing things like writing Nix flake references as bare words, e.g. nixpkgs#hello.

Lucretiel commented 1 year ago

Can we squeeze https://github.com/kdl-org/kdl/issues/213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

hkolbeck commented 1 year ago

I'm also a fan of #213, though it seems like there's some ambiguity in the discussion. Namely, does

- "x\
    y\
    z"

Translate to "xyz", "x y z", or "x\ny\nz"?

zkat commented 1 year ago

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

Lucretiel commented 1 year ago

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

Yes, tonight I can put that together :) should it be in the form of an amendment to SPEC.md?

Lucretiel commented 1 year ago

I'm also a fan of #213, though it seems like there's some ambiguity in the discussion. Namely, does

- "x\
    y\
    z"

Translate to "xyz", "x y z", or "x\ny\nz"?

I agree there's some ambiguity in the original. That example would translate to "xyz", because all literal whitespace after the \ is consumed and discarded. If you want to retain whitespace, it should either come before the \ or itself be escaped. I think my comment (https://github.com/kdl-org/kdl/issues/213#issuecomment-929869117) succinctly describes this.

zkat commented 1 year ago

Can we squeeze #213 into this? The specific proposal is the addition of escaped whitespace in string literals– that \, followed by literal (non-escaped) whitespace, should consume and discard all that whitespace. This is a slight simplification of the Rust rule, which specifically requires that \ be followed by \n.

@Lucretiel do you have time to put together a PR with this grammar+prose change? I'm game.

Yes, tonight I can put that together :) should it be in the form of an amendment to SPEC.md?

yep!

I agree there's some ambiguity in the original. That example would translate to "xyz", because all literal whitespace after the \ is consumed and discarded. If you want to retain whitespace, it should either come before the \ or itself be escaped. I think my comment (https://github.com/kdl-org/kdl/issues/213#issuecomment-929869117) succinctly describes this.

Is this what Rust does? I would've expected that to at least preserve the first newline. Then again, this is consistent with KDL's existing escline rule where \<newline> is the same as <non-newline whitespace>

CAD97 commented 1 year ago

Is this what Rust does?

[playground]

[src/main.rs:2] dbg!("\
    here\
    is\
    an\
    example\
    ") = "hereisanexample"
hkolbeck commented 1 year ago

It's worth noting that bash behaves similarly as far as just dropping the newline, though it doesn't consume space afterward:

❯ echo foo\
… ❯ bar\
… ❯ baz
foobarbaz

With that I think xyz is the right output, and am +1 on including it in v2

Edit: Scratch that, I'm a space cadet:

❯ echo foo\
      bar\
      baz
foo bar baz

I'm more prone to emulating bash over rust, but I'm curious how others feel

Lucretiel commented 1 year ago

Bash's behavior is concerned with syntactic whitespace (ie, allowing commands to spread over multiple lines with line continuations). It doesn't meaningfully behave in terms of consuming or not consuming specific whitespace so much as it extends a line to the next line while retaining the separation of tokens for a command. In your echo example, all that's happened is that the foo and bar and baz have correctly been passed as different arguments to echo; it's no different than:

> echo foo             bar \
   baz
foo bar baz

Kaydle has basically the same behavior with its own line continuation syntax, where you can use a \ to continue a single node into the next line. All these nodes are the same:

node 1 2 3
node 1    2   3
node 1\
  2\
  3

213 is instead concerned with treatment of escaped whitespace in strings, where I think the plain consumption of unescaped whitespace makes the most sense

Is this what Rust does? I would've expected that to at least preserve the first newline. Then again, this is consistent with KDL's existing escline rule where \ is the same as

Rust does just consume all whitespace, regardless of type. The canonical way to add newlines to a whitespace-escaped string to to escape them:

assert_eq!(
    "line 1\n\
    line 2\n\
    line 3\n",

"line 1
line 2
line 3
"
);

Though more commonly I use it to stretch out long sentences with simple spaces:

assert_eq!(
    "This is a sentence with a \
    lot of words in it.",
    "This is a sentence with a lot of words in it."
);
hkolbeck commented 1 year ago

That makes sense, and the distinction is certainly important. Thanks for the complete writeup.

hkolbeck commented 1 year ago

Adding escaped whitespace note to the changelog: https://github.com/kdl-org/kdl/pull/291

zkat commented 1 year ago

Nudging the thread because I've added https://github.com/kdl-org/kdl/issues/250 to the bucket of things we should probably discuss for 2.0

hkolbeck commented 1 year ago

Is https://github.com/kdl-org/kdl/discussions/177 worth including in discussions here?

zkat commented 1 year ago

yeah, probably. Although I'm inclined towards having foo and ("")foo be distinct values. I'm kinda iffy on the special case here.

hkolbeck commented 1 year ago

I'm very split, on the one hand it is a special case and I'm very averse, but as @Patitotective noted this it would force implementations to distinguish between the two, which would lead to a more complex API in some languages (JS is top of mind). In addition, I just don't see a use case for blank type annotations given that impls are free to define their own.

I'm not really trying to go either direction here, just laying out thoughts.

lilyball commented 1 year ago

Why would it be hard in JS? Can't JS just use null versus ""?

I haven't really been following this and I have no experience with type annotations in KDL but my initial reaction here is that specifying "" is potentially semantically distinct from not specifying a type annotation.

I'm curious what languages are actually expected to have a problem here? Languages without proper optional support tend to have some concept of null. Even in Go I'd expect that you could use a pointer-to-string in order to have nil.

That said, I've been using languages with proper optional support for long enough that I'm not sure how much of an ergonomic problem it would be to require folks to handle null type annotations in languages like Go or JS.

hkolbeck commented 1 year ago

JS gives you null, but empty strings are falsey. It's for sure a small thing, it just means instead of writing:

if (val.annotation) {
   ...
}

You have to write:

if (val.annotation === null) {
    ...
}

I would count that as a more complex API.

Patitotective commented 1 year ago

For most statically typed languages (like Nim) you have to use Option types which distinguish between no value and empty value. This would make APIs more complex and seems pointless to me, ("")node should be the same as node.

bgotink commented 1 year ago

I'm pro making them distinct values. Packages that work with the CST to provide an API for modifying KDL text while retaining comments and formatting would be more complex and likely inconsistent without the distinction.

If an empty or an absent annotation is considered the same, these packages would need to track that. Even if they then map empty and absent annotations onto the same public value, it would lead to confusing behaviour in the locations the different CST nodes:

node /* comment */ ()null
//                |         < end of the leading whitespace for the `null` value
//                   |      < start of the `null` value
//                 __       < missing locations

Making the empty annotation () part of the leading whitespace would be wrong cf. the language specification.

Imo the best solution for these packages is to expose the difference between an empty and an absent annotation: consistent CST node locations and no () as part of whitespace.

hkolbeck commented 1 year ago

One minor point to @bgotink's example, which I think is a good point, is the ()val does not agree with the spec by my reading, it has to be ("")val

CAD97 commented 1 year ago

I agree with @larsgw here in that if this is considered a problem to solve, the better solution would be to forbid zero-length identifiers rather than do some magic to make ("") equivalent to no type annotation. Given that type annotations are not given any meaning[^1] by the specification, it's fine imho for an implementation to treat a present-but-zero-length type annotation equivalently to no type annotation.

[^1]: > KDL does not specify any restrictions on what implementations might do with these annotations. They are free to ignore them, or use them to make decisions about how to interpret a value.

So my vote here is no change.

Patitotective commented 1 year ago

I agree on considering this a problem and forbidding zero-length identifiers. This change would convert valid tests like blank_node_type.kdl, blank_arg_type.kdl, blanl_prop_type.kdl, empty_quoted_prop_key.kdl and empty_quoted_node_id.kdl to invalid, it would need to be specified in the spec. If this is okay, I'll create a PR.

Patitotective commented 1 year ago

By the way, is there a reason why some tests use empty and others blank?

Lucretiel commented 1 year ago

I'm pro-distinct too. I think it's fine if a particular consumer of KDL wants them to be identical, or even if a particular implementation of type hints treat them as identical, but I think that it makes sense that a KDL data model treats them as distinct (essentially, as Option<String>). I'd be opposed to requiring implementations to treat them as identical.

One example would be a particular KDL implementation that uses annotations exclusively as strong type hints. I'd want 123 to be dynamically typed, (f64)123 to be a float, and ("")123 to be an error.

Lucretiel commented 1 year ago

I'd be opposed to banning 0-length identifiers; I think that adds more complexity / confusion than it's worth. Currently, an identifier is either "bare identifier" or "quoted string", and I'm not a fan of making it instead a special non-empty subset of "quoted string". I really like how the string is the "escape hatch" into unusual identifiers and don't really see the value in constraining it (especially since a vast majority of languages don't have an ergonomic way to express "string that's definitely not zero-length").

Patitotective commented 1 year ago

For v2.0 KDL could also allow multi-line comments not only where whitespaces can be, allowing node (type)/*comment*/true.

tabatkins commented 10 months ago

We should add in the #331 change, to add vertical tab to the whitespace table. (Edit: done.)

tabatkins commented 10 months ago

For the #204 vs #241 decision (how to handle # in idents), everyone who's commented in the thread seems to either be neutral or leans toward #241 (just ban idents that look rawstring-like). So I'm going to commit that into the branch.

tabatkins commented 10 months ago

I'm strongly inclined to accept #255 and #250; they're both banning invisible control characters from idents (and for #255, matching the grammar to what the prose already indicates).


For #270, I'm inclined to accept it. It does make the grammar nastier, unfortunately, but being able to drop comments in wherever is quite nice when debugging. I was slightly concerned this might put us in a similar unfortunate situation as CSS where 2/**/px was allowed (and requires special serialiation rules to force that to serialize as 2 px rather than 2px, but nah, KDL doesn't go with CSS's "any token sequence is allowed, whitespace can go anywhere" philosophy. Since our whitespace allowance/requirements are strictly indicated in the grammar, this should be fine. (That is, node/**/prop="foo" would still be invalid, as you need whitespace between the nodename and the property name.)


For #285, inclined to accept it. It makes the grammar a little more understandable, I think, since it separates out slash-dashed things from the core notion of what forms the document. And allowing multiple slash-dashed children blocks, or interspersed slashdashed children and arguments, also seems like a reasonable thing, yeah.


For #228, I'd want to do a minor editorial pass over the document before merging it, but otherwise it looks good to go.

zkat commented 8 months ago

Hi all. I've tagged 2.0.0-draft.1, and it's ready to review! Please give it a look and let's see what else we want to change from here. If you're wondering what changed, I suggest looking at the changelog.

Nothing is really set in stone, but I figured I would incorporate a bunch of stuff we've been sorta-kinda talking about but not really committing to, and see how it feels. I'm feeling pretty good about it!

IceDragon200 commented 8 months ago

Just some observations while updating kuddle to support the v2 spec, I may add more as I find them

Multiline Strings

Dedentation

Strings may span multiple lines with literal Newlines, in which case the resulting String is "dedented" according to the line with the fewest number of Whitespace characters preceding the first non-Whitespace character. That is, the number of Whitespace characters in the least-indented line in the String body is subtracted from the Whitespace of all other lines.

The first Newline, the last Newline, along with Whitespace following the last Newline, are not included in the value of the String. The first and last Newline can be the same character (that is, empty multi-line strings are legal).

// So these are all valid
nl_only "
"
no_space "
ABC
"
two_spaces "
  ABC
"
// Including this, which looks funky
heredoc "
a
    "
// the outdent here is scaring me, but it's valid I suppose
heredoc2 "
  do
    some code
  end
    "

que? "
    something
      something2
        something3
  "
// Since its a multiline string, but only the first newline is present, the resulting string should be empty
nl_only ""
// Multiline string with a single content  line
no_space "ABC\n"
// Multiline string with two spaces before first content line
two_spaces "  ABC\n"
// heredoc with uneven indentation, since a is the shortest line it will be used to trim the rest
// the preceding whitespace before the " is discarded
heredoc "a\n"
heredoc2 "do\n  some code\nend"

// Here is the questionable one: should it end up like this (this is what would happen in elixir)
que? "    something\n      something2\n        something3"
// Or like this (this is what would happen based on the clauses above, assuming last whitespaces are discarded)
que? "something\n  something2\n    something3"

So why is this section here?

It's the dedent rule that's the problem, those two clauses describe the newline and dedent behaviour, however it doesn't make clear if the whitespace AFTER the last newline is to be counted as the dedent, since it's discarded by the second clause.

Misleading literal newlines clause

On the topic of multiline strings, I found a wording issue:

Strings with literal Newlines that do not immediately start with a Newline and whose final " is not preceeded by whitespace and a Newline are illegal.

node "a
"
node "
"

Those are both valid strings, the latter passes the starts with a newline, but fails the latter preceded by whitespace and newline.

Those would both evaluate to the following:

node "a\n"
node2 "\n"

Which is are strings with literal newlines: The first did not immediately start with a newline, and its final " is not preceded by whitespace AND a newline (keyword here AND, the final " is preceded by a single newline, but the wording made it illegal).

I'd argue removing that line all together, because it is misleading in its current state.

zkat commented 8 months ago

@IceDragon200 Here's how you should interpret the following strings.

The last newline is also stripped:

no_space "
ABC
"
no_space "ABC"

Things are fully dedented based on their least-indented line:

que? "
    something
      something2
        something3
  "
que? "something\n  something2\n    something3"

however it doesn't make clear if the whitespace AFTER the last newline is to be counted as the dedent, since it's discarded by the second clause.

It should not be counted. It should be fully stripped, including including the newline associated with it.

Strings with literal Newlines that do not immediately start with a Newline and whose final " is not preceeded by whitespace and a Newline are illegal.

This could be worded better, I guess. Perhaps "Strings with literal newlines not immediates after the opening ", or whose final " is not preceded by (optional) whitespace and a newline are illegal".

So:

// illegal
foo "a
"

// legal
foo "
"
// evaluates to:
foo ""

For the sake of having something to play with, these dedenting rules are based directly on JavaScript's dedent proposal, and there's an interactive playground you can use to test out any corner cases.

If you find some inconsistency between the KDL spec and what the JS spec are describing, the KDL spec should be updated to conform to JS', since I trust them to have thought through the corner cases better than me.

IceDragon200 commented 8 months ago

For the sake of having something to play with, these dedenting rules are based directly on JavaScript's dedent proposal, and there's an interactive playground you can use to test out any corner cases.

We need to keep in mind that proposal is intended to fix a syntactical shortcoming of JS, that is, the lack of a proper heredoc.

They even list ruby's heredoc <<~ which does the exact thing I expected: https://github.com/tc39/proposal-string-dedent#in-other-languages

Even elixir's heredoc (which I count as the golden standard for its consistent behaviour):

node = """
    ABC
  """

# "  ABC\n"

In this case, the final """ denotes the strings root identation level, the equivalent in KDL would be:

node "
    ABC
  "

// Which should be this if it followed that same ruling 
node "  ABC\n"

Since KDL is trimming based on the content of the string, I can see it trimming actual intended indents:

node "
            This string is intended to have its indents preserved
       Since all of the text should be centered like this line and others        
    It's an unusual test case for sure, but it's something that could happen
             The the leading whitespace on each line is intended.
           But the current implementation will end up truncating it.
"

// Would end up like this using JS dedent
node "
        This string is intended to have its indents preserved
   Since all of the text should be centered like this line and others        
It's an unusual test case for sure, but it's something that could happen
         The the leading whitespace on each line is intended.
       But the current implementation will end up truncating it.
"

On the topic of newline shenanigans, I'd argue as long as both quotes are present, any oddball combination should theoretically work:

// These should resolve to the same thing
node "a
"
node "a\n"

// These are the same because of the multiline rule, trimming the immediate newline
node "
"
node ""

// These should be valid (there is a literal space after the quote followed by a newline)
// By the same rule that disallowed node "a\n", this is allowed
node " 
"
node "\s\n"

// This should be illegal, at least based on the multiline structure, but looks fine technically
node "
a"
node "\na"

// But this should be okay? (there is a space after the first quote)
node " 
a"
node "\s\na"

// Should be fine
node "\n
a"
node "\n\na"

I'm sure there are other oddball scenarios out there, but all of those look valid enough to me.

zkat commented 8 months ago

I actually kinda like the idea of determining overall indentation by the indentation of the last ", although that means it should be a syntax error to have something like this:

foo "
bar
  "

Otherwise I really don't like our alternatives for dealing with that. It should just be invalid. But communicating this constraint might feel really restrictive and confusing. That said, I understand the desire to have the entire string indented.

As far as your other examples go, I think the "multi-line strings MUST be " + newline" is an important constraint because it clarifies that we are now in multi-line MODE, because it has special logic to it. I feel the same about having the closing " in its own line.

Most other languages deal with this by using special syntax for their "heredocs", but this is a way for KDL to reuse existing string syntax.

Bannerets commented 8 months ago
  • All strings in KDL are multi-line, and raw strings are written with Rust-style syntax (r"foo"), instead of backticks.

Looks like this line in the README should also be changed to use the new raw string syntax.

IceDragon200 commented 8 months ago

Test case unusual_bare_id_chars_in_quoted_id.kdl needs updating as it still contains a # which is considered illegal

IceDragon200 commented 8 months ago

Unicode test for RLM and LRM are present, but the specific characters are not mentioned in the linked reference: https://www.w3.org/International/questions/qa-bidi-unicode-controls, and not references in the spec itself.

tl;dr We need to add U+200F and U+200E to the list of disallowed characters

IceDragon200 commented 8 months ago

Invalid/Incorrect Tests

Newlines

Some expected tests have extraneous newlines or not enough newlines at the end of the file:

zkat commented 8 months ago

@IceDragon200 LRM/RLM are listed in a different section, and there were already tests for it. I've added them to the spec, though, that was an oversight.

Re: incorrect tests, thanks for the review! I'll go fix all these now!

IceDragon200 commented 8 months ago

Tests with discussions

Mismatch Tests

Impossible Tests?

Old Keyword is now bare identifier

The following tests are now all valid, as the identifiers here are no longer reserved

Multiline String Tests

Oddities with node /wo children

These tests drop their brackets on output, not sure if that's intended

It might just be kuddle, but the presence of the brackets denote the possibility of children, vs without [^1].

// Implicit
node
// Explicit
node { }

Should this be valid?

The comment would be dropped during parsing

[{:term, "node"}, :space, :(, {:term, "type"}, :), {:term, "10"}] #=> node (type)10



[^1]: In respect to kuddle, nodes with or without brackets are maintained, there is a difference between a node with no implicit children (`children: nil`) and a node with no explicit children (`children: []`)
tabatkins commented 7 months ago

@IceDragon200 Re your comment about several tests

IceDragon200 commented 7 months ago

@tabatkins One more issue, how is the dot_zero (.0) being restricted:

identifier := string | bare-identifier
bare-identifier := (unambiguous-ident - boolean - 'null') | numberish-ident
unambiguous-ident := (identifier-char - digit - sign) identifier-char*
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
identifier-char := unicode - line-space - [\\/(){};\[\]="#] - disallowed-literal-code-points

sign := '+' | '-'

Because looking over that, .0 is legal (it falls under unambiguous-ident

[{:identifier-char, "."}, {:digit, "0"}] 
tabatkins commented 7 months ago

Per #366, I've made .1/etc illegal as idents. I've also done some minor editorial rearranging of the grammar so that it matches the spec concepts a little better.

IceDragon200 commented 7 months ago

@tabatkins Any tests for the new .0 restrictions (including prefixes)?

tabatkins commented 7 months ago

I didn't add any, but it'll definitely need some. One sec.