kdl-org / kdl

the kdl document language specifications
https://kdl.dev
Other
1.09k stars 61 forks source link

(grammar) Matching decimal first prevents 0b prefix from being matched #330

Closed eugenesvk closed 1 year ago

eugenesvk commented 1 year ago

It seems that the grammar has a decimal number conflict

Let's see how this binary number should be parsed:

0b10
↑`integer` @ `decimal` since `decimal` is the 1st rule in `number`
 ↑ error since it's not `integer`, and binary starts with a digit

Excerpt from the spec grammar

number  := decimal | hex | octal | binary
decimal := sign? integer ('.' integer)? exponent?
binary  := sign? '0b' ('0' | '1') ('0' | '1' | '_')*
sign    := '+' | '-'
integer := digit (digit | '_')*

So either decimal should come last in the rules to allow for 0b to be earlier parsed as a binary, or it should exclude 0b if the first match is 0

(VSCode syntax highlighting works around by adding \b word boundary, but then word boundary might have a tricky interplay with - sign, so you'd need to shift it to the integer check, a proper grammar fix would be better)

larsgw commented 1 year ago

Would this not depend on how the grammar is implemented? Does the order matter for parsers with backtracking if there is never an ambiguous situation?

eugenesvk commented 1 year ago

Would this not depend on how the grammar is implemented?

Yes, you can disregard the grammar and do the correct thing :)

Does the order matter for parsers with backtracking if there is never an ambiguous situation?

Don't know, is it ok if the grammar is correct only under these conditions?

IceDragon200 commented 1 year ago

That is a good point, as someone who wrote their parser by hand and from scratch, I indeed matched on special forms first and then decimal, I wasn't even aware of the "grammar" until recently, I just did it based on the README.

I'd argue the order does matter, in the original grammar, it's possible to end up in an invalid state immediately:

0b0001

would possibly be parsed with "0" as a decimal and then "b0001" as an id, granted, it shouldn't because you must have some kind of space between them, but still, bad experience for the end user

larsgw commented 1 year ago

I just meant that the current grammar might technically not specify an actual order, but provide an unordered sense of alternatives (as the grammar should be unambiguous for a full file either way, I think). That's all semantics though, I have nothing against changing the order. I think it occurs in a lot more places though. Wouldn't the alternatives of node-space also be a problem without some form of backtracking? Something like (ws+ escline? | escline) ws* would work but it's a lot less clear to humans in my opinion, and it's not like the grammar is fully machine-readable anyway.

eugenesvk commented 1 year ago

Something like (ws+ escline? | escline) ws* would work but it's a lot less clear to humans in my opinion

For what purpose would those humans read the grammar where correctness is less important than readability. For example, I was reading it to create a syntax highlighter, and all these incorrect grammar rules make it harder (and readability doesn't help)

and it's not like the grammar is fully machine-readable anyway.

There's another issue for this

larsgw commented 1 year ago

Sorry, I made a wrong assumption about node-space, that does not seem to have the same issue (at least in PEG). I think the number rule is the only case where a subsequent alternative 'fits' entirely in a previous alternative (which seems to be the problem; not the first characters overlapping as I previously thought).

For what purpose would those humans read the grammar where correctness is less important than readability. For example, I was reading it to create a syntax highlighter, and all these incorrect grammar rules make it harder (and readability doesn't help)

Again, I don't think the grammar presently has an issue with "correctness". The grammar itself doesn't promise to have alternatives in any meaningful order, nor does it follow a specification that makes that promise. Converting this grammar directly to a RegExp works entirely okay. Requirements placed on grammar structure can vary greatly between parsers, and I'd argue that makes it all the more important that the grammar (formally) conveys the format clearly, without being obscured by implementation details (same as with pseudo-code). (To be clear, the change for the number rule would not obscure the format; the change to node-space would IMO.)

tabatkins commented 1 year ago

Right, the grammar doesn't specify any particular ordering semantics (first-match, longest-match, etc) so the assumption should be that it's the maximally correct parse, with unbounded backtracking if need be, and any particular ordering is an implementation issue. Any ambiguity where two distinct valid parses can be created from the same input is a bug, but if it's a choice between "parse this way and have an invalid document, or parse that way and have a valid document", the valid-document parse is the correct one.

That said, I don't see any reason not to move decimal to the end of the number production; like many others, my parser tests for decimal last as well. Since this is a non-normative change for readability, I think I can just make the tweak inline.