mdaines / grammophone

A tool for analyzing and transforming context-free grammars.
https://mdaines.github.io/grammophone
MIT License
200 stars 23 forks source link

support EBNF #50

Open VladimirAlexiev opened 3 months ago

VladimirAlexiev commented 3 months ago

I want to analyze this grammar for complexity, it's in EBNF: https://github.com/VladimirAlexiev/shacl/blob/shaclc-grammars/shacl-compact-syntax/grammar/shaclc-XText.ebnf. It's also available in Eclipse XText: https://github.com/VladimirAlexiev/shacl/blob/shaclc-grammars/shacl-compact-syntax/grammar/shaclc-XText.xtext.

Is it possible for Grammophone to support one of these formats?

mdaines commented 3 months ago

I'd like Grammophone to do this. I think the obvious solution is to convert EBNF to BNF just after parsing by adding rules to the grammar. However, when considering this before, I wasn't sure how to present calculations with these additional rules.

For example, this grammar:

S -> x y* .

...could be converted internally to something like this:

S -> x S_ .
S_ -> y S_ | .

This would add S_ to the nonterminals table, parsing tables, etc. Maybe that's OK, but I was concerned about the results of the conversion being difficult to follow, especially for a more complicated grammar.

Perhaps the added nonterminals could be displayed differently, or the conversion to BNF could be shown explicitly somehow.

modulovalue commented 3 months ago

There's one huge issue that prevents a traditional EBNF grammar from being straightforwardly-useful with grammophone. The conversion between CFG and EBNF is not "lossless". There's some amount of ambiguity there as there are alternative interpretation strategies for the constructs that EBNF comes with.

For example, repetition operators can be interpreted to either mean a left recursive or a right recursive version.

Left recursive: on grammophone

Start -> S.
S -> a.
S -> S a.

Right recursive: on grammophone

Start -> S.
S -> a.
S -> a S.

Both describe the same language. Which one should be used?

There are more implementations possible, see this comment: https://github.com/mdaines/grammophone/issues/26#issuecomment-1716641044

There's no clear winner there so it's not clear which implementation should be used as each has the ability to offer different benefits wrt to the conflict and lookahead profile of a CFG.


Here are two separate solutions to the actual issue (which is to support a more concise grammar formalism)

  1. the "Parametrization" approach
Start -> StarRightRec(a).
StarRightRec(X) -> X.
StarRightRec(X) -> X S.

Which would make it possible to define ones own constructs.

  1. the "Native Rules" approach:
Start -> *rr(a).

where *rr/*lr is desugared into a fixed right recursive/left recursive form behind the scenes. The "semantics" of the star operator would be fixed and users of grammophone would precisely know what *rr or *lr is converted to.

I think the second approach is a much better fit for grammophone. The first one is very convenient, but much harder to implement.

Extra: 3. The "cutting-edge-infinite-resources" approach.

Grammophone could use e-graphs (https://egraphs-good.github.io) for finding the best version that leads to the fewest conflicts. This has not been done before and it is much harder, but I hope one day we'll have something like that.

modulovalue commented 3 months ago

However, when considering this before, I wasn't sure how to present calculations with these additional rules.

This would add S_ to the nonterminals table, parsing tables, etc. Maybe that's OK, but I was concerned about the results of the conversion being difficult to follow, especially for a more complicated grammar.

I think this could be achieved by taking a dataflow approach (either via DeRemer & Penello or via Pingali #27) and introducing unique nodes for the extended constructs. (Edit: Pingali discusses how a dataflow approach over a GFG can be used to find lookaheads.)

Regular Expression-to-NFA compilers can find firstsets/followsets just from the regular expression itself without desugaring those constructs into a CFG. I think this should also be doable on the CFG level.

mdaines commented 3 months ago

For example, repetition operators can be interpreted to either mean a left recursive or a right recursive version.

Of course, somehow I forgot about that...

I think the ambiguity has a benefit, though. Grammophone could choose an arbitrary interpretation but also allow the user to select one interactively (ie, not using syntax). This would allow for demonstrating the differences between the two.

So, given the example grammar:

S -> x y* .

Grammophone could indicate that y* is interpreted like this:

y* => S_ -> y S_ | .

(Let's say => means "the EBNF construct is interpreted as..." or something.)

But also allow the user to select this:

y* => S_ -> S_ y | .

I'm not sure where this would be presented. Maybe above the "Sanity Checks" section.