Remove '#' from legal characters in a bare identifier

hkolbeck commented 2 years ago

Based on discussion here: https://github.com/kdl-org/kdl/discussions/200

zkat commented 2 years ago

So KQL current picks out two more symbols that are currently legal identifiers: ~ and +

See https://github.com/kdl-org/kdl/blob/main/QUERY-SPEC.md#selectors

Should we use those, or reserve something else?

hkolbeck commented 2 years ago

Do you for-see adding general regex support to string matchers? The natural operator seems like ~= which is perhaps another argument for removing ~. That said, I think it depends entirely on the formal grammar for KQL. If whitespace is required between identifiers and operators, then I don't think ~/+ being legal is an issue, since as far as I can tell the token in that position can only be an operator.

EDIT: Haha just kidding, I forgot a b was a legal (and common) selector. I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b.

hkolbeck commented 2 years ago

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

zkat commented 2 years ago

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

oh I didn't realize that. That's right!

I suppose we do have the option of breaking the css similarity and moving descendent selectors to something like a >> b.

I like this a lot, but thinking about verbosity, I think it's must more straightforward to say a b than a >> b, esp with "real" node names? I think the >> will become cumbersome? thoughts?

I'm also pretty in favor of removing ~ from identifiers and reserving it. If we do that, I can't think of anything else we might want to reserve that's important to.

hkolbeck commented 2 years ago

I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.

I'm in favor of removing ~ as well.

zkat commented 2 years ago

I'm also kinda wanting to bring back /?

zkat commented 2 years ago

if we do that, URLs would become legal identifiers in most cases? I'm trying to think of when they wouldn't be, but maybe "URL encodable" might be the target we want fo KDL 2.0?

zkat commented 2 years ago

https://stackoverflow.com/questions/1547899/which-characters-make-a-url-invalid/1547940#1547940

Or maybe not. That means we would need to include []();=', among others :(

hkolbeck commented 2 years ago

I feel like / is even more fraught than any of the others we've discussed. It sucks to make them force-quoted, but take

// "foo"

Is that a commented line or a node named //?

zkat commented 2 years ago

oh right I forgot about comments haha

zkat commented 2 years ago

Should we make a new branch and make this PR against that, btw? Like a 2.0 tracking branch that has 1.0-breaking changes in it?

zkat commented 2 years ago

boom, done :)

you might need to rebase onto it tho I think I did something bad.

tabatkins commented 2 years ago

+ isn't actually an issue, I don't think, since it's already not legal as the first char in a bare id.

Yes it is, so long as the second character isn't a digit; the grammar just prevents confusion with numbers. It's allowed by itself as a bare-ident.

tabatkins commented 2 years ago

I don't think >> is too cumbersome, and as someone who doesn't do much complex css, it better communicates the meaning, imo.

Fwiw, the CSSWG (aka, me, in this case) tried to add >> as an alternate way to spell the descendant combinator. It failed due to lack of impl interest, but I would have liked it (and ++ as the way to spell ~).

zkat commented 2 years ago

So how about making the final list of removed characters just # and +?

tabatkins commented 2 years ago

I'm fine with that. (We could keep + as an ident char but just remove it from initial ident chars if that would still be useful, so KQL could still tell apart + from an ident. I don't have a strong opinion either way.)

zkat commented 2 years ago

ugh but + is so useful, esp as a first char. I wonder if there's something else we could use for "sibling"/"next"?

Also, should we pre-emptively ban |? I feel like it smells like a bracket identifier to me. And it means we could use it for OR syntaxes?

zkat commented 2 years ago

after thinking about it, I don't know if we should remove any extra characters beyond #. We're removing # because it potentially really complicates some kinds of implementations, but I don't think we need to reserve anything else for the sake of the query language as long as we make one change: queries must be <matcher> <operator> [<matcher> <operator>]. That is, we no longer do descendants with spaces, and use >> for "descendants", as @tabatkins described.

I think that'll take care of it, and we can just merge this branch. What do y'all think?

tabatkins commented 2 years ago

Yeah, that sounds really good. It leaves KQL's evolution much more open without tying it so intimately to KDL's precise syntax model, and avoids restricting KDL itself unnecessarily.

(Recall that CSS's implicit descendant combinator comes from the very first version, when there weren't any combinators at all. Child/next-sibling/general-sibling are all later inventions layered on top. We don't need to repeat that.)

tabatkins commented 2 years ago

Re: # in particular, a final possibility I don't think has been brought up is just disallowing an ident from starting with r# specifically. This appears to be kdl4j's current behavior, per the conversation in #200.

It's certainly slightly clumsier than a simple blanket prohibition, but it also means we maintain maximum ident syntax without requiring excessive lookahead.

hkolbeck commented 2 years ago

I'm wavering. It's definitely clearer and probably easier to implement with parser generators to simply ban #, but I can absolutely see people wanting to use hashtag-like identifiers and it feels worthwhile to make that clean.

tabatkins commented 2 years ago

Yeah, so just outlawing the numberish or rawstringish starts are easy for hand-rolled parsers, but for the grammar itself, let's see...

// current
bare-identifier := ((identifier-char - digit - sign) identifier-char* | sign ((identifier-char - digit) identifier-char*)?) - keyword

// proposed new grammar
bare-identifier := (bare-ident-start identifier-char*) - keyword
bare-ident-start := ((identifier-char - digit) identifier-char?) - (sign digit) - "r#"

// or possibly
bare-identifier := (unambiguous-ident | numberish-ident | stringish-ident) - keyword
unambiguous-ident := (identifier-char - digit - sign - "r") identifier-char*
numberish-ident := sign ((identifier-char - digit) identifier-char*)?
stringish-ident := "r" ((identifier-char - "#") identifier-char*)?

I suspect the latter is the most compatible with parser gens, since the subtractions are single-character set subtractions, rather than sequence subtractions like the - keyword.

zkat commented 9 months ago

Note: With https://github.com/kdl-org/kdl/pull/354, and potentially with whatever happens with #350, I think we can be much less aggressive here and just ban # as the first character in identifiers, like we do with signs and digits already. I'll keep this open for now, but I'm thinking of just resolving things that way.

zkat commented 9 months ago

This is done in kdl-v2 now, so I'm closing this one. # is outright illegal now.

kdl-org / kdl

Remove '#' from legal characters in a bare identifier #204