kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.12k stars 62 forks source link

KDL 2.0: #-prefixed identifier-based strings (and a raw string change) #350

Closed zkat closed 9 months ago

zkat commented 10 months ago

After some thought and some discussion with folks, I think a great alternative to #339 is to instead allow for quoteless strings, but only if they're prefixed with #.

So:

command {
  exec #cargo #run #-- #~/.foo/bar/baz.txt
}

Specifically, we would make the following changes:

  1. # would be banned from the identifier grammar only as the first character, but allow it anywhere else in the identifier.
  2. Any #-prefixed identifiers will be interpreted as string values, using the standard identifier grammar.
  3. Remove the r prefix from raw strings, such that raw strings are just #""#. The r is essentially unnecessary.
larsgw commented 10 months ago

And because identifiers cannot start with a #, any raw string can immediately be distinguished after the #, even if it starts with ## and not #"!

zkat commented 9 months ago

So I ran into a bit of a hitch, which is that \ and / are both banned from identifiers, and for good reason (line continuations for \ and comments for /).

It's possible to enable / use in identifiers, but it's gonna be complicated, and possibly a footgun ("/ can't be followed by *, -, or /")

Supporting \ is trickier, because it would mean \ would be expected to be a valid node, but then...

foo {
  foo #x\
  "bar"
}

Could be very surprising if you intended to put a space after x but typoed it, and I really don't like that or think it's worth it.

tabatkins commented 9 months ago

I don't think we need to worry too much about that, yeah. I'm comfortable with a slash requiring you to quote the string.

tabatkins commented 9 months ago

The big question, I think, is just whether we think "ident-strings" are more or less common than true/false/null, and which feels more natural to have unprefixed / awkward to have prefixed.

I don't have a strong opinion here, and having # be generically associated with strings is somewhat appealing, but overall I think I lean slightly towards # being an indicator of something language-specific (a keyword, or a specially-quoted string) is a little more natural. Seeing #true to indicate a boolean doesn't feel too weird, I think. foo bar=#true #null seems okay to me?

(This makes me old, but it slightly harkens to Scheme with its #t and #f bools. I do prefer them spelled out, tho.)

Meanwhile, it feels slightly better imo to write

command {
  exec cargo run -- "~/.foo/bar/baz.txt"
}

This hews closer to the original design inspiration of "look kinda like console commands" - you still have to quote strings slightly more often than you do in the console, but overall it's very similar.

Using # for keywords also means we avoid the future-compat issue of node names not being allowed to be keywords - you can't currently name a node true, for example. Adding more keywords in the future would become a backward-breaking change for any document that was innocently and validly using that node name, while adding a new #foo keyword would just be a forward-breaking change (in semver terms, adding a new #foo keyword would be a minor version bump, while adding a new foo keyword would be a major version).

tabatkins commented 9 months ago

Never mind, I argued myself into a stronger position.

It is already, in 1.0, the case that a string requires positional/contextual information to know how to interpret it. You can write "foo" "bar"="baz" "qux" and have to recognize that "foo" is a node name because it starts the node, "bar" is a property name because it's followed by an =, and "baz" and "qux" are both strings because they're neither of those cases.

Currently, 1.0 allows you to write two of those as idents, if they adhere to ident syntax: foo bar="baz" "qux". The other two must remain quoted. (Tags are a third thing that's "strings, but can be unquoted if they're idents", but they're more obviously separable from the rest syntactically. Still, tho.)

I think the most consistent and readable upgrade for 2.0 to do would be to make it generically the case that strings can be unquoted if they follow ident syntax, so foo bar=baz qux becomes valid. Then our three special keywords just violate ident syntax by starting with a #, once we ban idents from doing that.

This consistency applies the other way, too: in 1.0, "true" "false"="null" is valid and meaningful. And all three look like idents, but true false=null is invalid in two ways and changes meaning in one. With this "all idents are strings, keywords start with #, tho, true false=null would be completely valid and mean the same as "true" "false"="null". Trying to sub those with keywords would be much more obviously invalid: #true #false=#null.

zkat commented 9 months ago

So you're arguing in favor of #true/#false/#null being the only way to write our "keywords" going forward?

Honestly, I think I can get on board with that, but part of me is feeling kinda queasy about making everyone prefix what are probably very common keywords. Then again, it works for Scheme, and it would absolutely, positively, and safely give us idents-are-strings across the board, and that is really high value. It also makes keywords VERY obvious.

At this point, my only two hesitations against #true and co are:

zkat commented 9 months ago

what do you think?

marrus-sh commented 9 months ago

I do like the simplicity of :⁠—

(Previously there was no way to recognize keywords from just the first few characters; this also solves #224.)

The fallback behaviour (of true parsing as "true" and not #true) is not so bad and could be handled in a compatibility layer after parsing. I think it is good to bear in mind the Makefile situation :⁠—

Why the tab in column 1? Yacc was new, Lex was brand new. I hadn't tried either, so I figured this would be a good excuse to learn. After getting myself snarled up with my first stab at Lex, I just did something simple with the pattern newline-tab. It worked, it stayed. And then a few weeks later I had a user population of about a dozen, most of them friends, and I didn't want to screw up my embedded base. The rest, sadly, is history.

I wouldn’t worry too much about existing KDL documents when the vast majority of KDL documents are yet to be written. The bigger question is whether implementations can be updated (if major implementations remain stuck in KDL 1.0, then cross‐compatibility with that version becomes more important; this is an enduring problem with YAML 1.1 ~ 1.2 for example).

zkat commented 9 months ago

thanks for voicing this. I'm definitely settling into #true &co more comfortably now and think it might be The Right Thing in the end.

As far as implementations go, it's definitely going to be a bit of time before all the implementations are up to speed with 2.0, BUT, I don't think any of the parser changes are absolutely massive, and the reality is that as far as compatibility goes, people writing KDL documents only consume those particular documents using a single tool (because it's meant as a config language, usually just the specific tool they're configuring), and not as a data-exchange format like JSON is, for which precise cross-implementation compatibility is critical, and it doesn't have YAML's extensive external tooling that also all needs to agree, where even a single config might involve multiple different tools/implementations interacting with the same doc...

tabatkins commented 9 months ago

So you're arguing in favor of #true/#false/#null being the only way to write our "keywords" going forward?

Yeah, or something that syntactically distinguishes the keywords from the rest of the ident space, so we can settle on "idents are just a way to write simple strings" as a generic language rule. # is just a reasonable-looking way.

I wouldn’t worry too much about existing KDL documents when the vast majority of KDL documents are yet to be written. The bigger question is whether implementations can be updated (if major implementations remain stuck in KDL 1.0, then cross‐compatibility with that version becomes more important; this is an enduring problem with YAML 1.1 ~ 1.2 for example).

Yes, this is a big point. KDL is still a fledgeling; we don't want to rip up the foundations (we'd lose all the knowledge we've gained from 1.0), but we shouldn't be afraid of significant changes if they have a very compelling reason based on that 1.0 experience.

And yeah, every change so far is very trivial, at least in my parser. Some types of generated parsers might have a little trouble with some 2.0 changes, but this change in particular won't be problematic for anyone to make, I'd bet.

The fallback behaviour (of true parsing as "true" and not #true) is not so bad and could be handled in a compatibility layer after parsing.

Yup. While we, as the general language, can't/shouldn't make weird compat hacks like that, it is absolutely reasonable for individual tools to do so as part of an update, if they need to support KDL 1.0 files. The cases where a bool is expected, and a string containing "true" would also be possible and reasonable, are so incredibly rare in practice. The fact that 1.0 documents will still be valid under the 2.0 change, just with a slight change in meaning, will make this very easy to deal with downstream.

zkat commented 9 months ago

@tabatkins true/false/null are actually banned identifiers, and I think we should keep that rule, tbh. Not just because of weird compatibility stuff, but because a lot of people expect true to be a boolean and that would be definitely a footgun for learners, whereas having them be syntax errors means there's very little chance of this being a problem (you'd need two errors together for fasle to be a problem)

tabatkins commented 9 months ago

Ah, I'd actually missed that "true" is an invalid node name in 1.0 even as a string. (My impl didn't; it's correctly rejecting that!)

So you're saying node true would be invalid (not equivalent to node "true"? I'm fine with that.

zkat commented 9 months ago

alright. I'll PR these changes soon, then. It definitely seems like the best choice of everything we've discussed, with the caveat of "#true is kinda stinky and it makes me oddly uncomfortable" lol. I'm sure folks will warm up to it pretty fast.

tabatkins commented 9 months ago

Side benefit: since they're now syntactically distinct, we can go full YAML and also mint #t, #yes, etc as synonyms. ^_^

larsgw commented 9 months ago

true/false/null are actually banned identifiers

Ah, I'd actually missed that "true" is an invalid node name in 1.0 even as a string. (My impl didn't; it's correctly rejecting that!)

I don't see this in the specification. Are we saying this is an invalid document in 1.0?

"true"
tabatkins commented 9 months ago

Sorry, I was messing around in the shell and screwed up my invocation. "true" foo=1 is indeed a valid node.

(I'd written echo "true" foo=1 | kdlreformat - - and gotten an error, but that's because "true" in bash will invoke true and sub itself with the result (nothing), so I was actually passing foo=1 to my formatter; when I tried with echo "tru" foo=1 it worked, because tru isn't a valid command so it remained a string. Once I rewrote it to echo '"true" foo=1' | kdlreformat - - it worked correctly. Yay bash!)

zkat commented 9 months ago

"true" is a valid node name. true is not.

tabatkins commented 9 months ago

Right. I got myself confused, but my confusion is irrelevant here. I'm still fine with true continuing to be an invalid ident.

This does mean that 1.0 documents will be invalid as 2.0, and thus can't be handled as a post-processing layer (at least, not without the tool having a legacy support mode, which I very well might add to kdlpy). But I do like the ability to signal to the user that the true they wrote intending it to be a boolean is incorrect.

zkat commented 9 months ago

closing this in favor of #339, then.