patrickhuber / Pliant

MIT License
26 stars 4 forks source link

Direct processing of EBNF (for better diagnostics) #81

Open ArsenShnurkov opened 7 years ago

ArsenShnurkov commented 7 years ago

2006, Sheng-Jun Wang, Cheng-Zhi Jin, An extension of earley's algorithm for extended grammars https://link.springer.com/chapter/10.1007/978-1-4020-3953-9_22 EUR 24.95 (It's text is also visible through google books preview)

I don't know do you use it, or not. That's my question and possible proposition (to use).

Using this modified algorithm, it is possible to highlight rules (and symbols) in grammar (when it is defined in text file), which are affected by parsing error(s) in sample input.

patrickhuber commented 7 years ago

I purchased the paper a while back, but didn't spend time working on the implementation.

If I remember correctly, the ebnf syntax is currently faked by generating node names during the grammar creation. If you specify :

Repetition

A := A { A }

It gets turned into two grammar rules

A := A {A}
{A} := λ | A {A}

This is the same for any grouping, though different groupings ex: [optional] or (group) have different rules that are generated based on the rules for ebnf to bnf.

Optional

A:= A [B]
B:= 'b'

results in

A := A [B]
[B] := λ | B
B := 'b'

Grouping

A:= A (BC)
B:= 'b'
C: = 'c'

results in

A:= A (BC)
(BC):= BC
B:='b'
C:='c'

This all happens in the ebnf grammar generator. See the Grouping, Optional and Repetition methods. You can see its use in the tests

ArsenShnurkov commented 7 years ago

This all happens in the ebnf grammar generator.

Yes, but there is no mechanism to trace errors back to rules in original EBNF grammar text.

patrickhuber commented 6 years ago

That is true, you would have to unwind the computed grammar rule post parse during AST evaluation. I do something similar with a single pass disambiguation algorithm.

Another work around is to rewrite the extended functions out of the grammar.

Found the paper, here is the table of operators listed:

Operator Name
a Terminal
A Non Terminal
E* Zero or More
E% One or More
E1 | E2 Or
E1#E2 List of E1 seperated by E2
E1 + E2 E1 or E2 or E1E2
{E} Group elements
[E] Zero or One
E1○E2 Sequence of E1 and E2 or E1 E2 for short

The paper suggests creating a state machine to represent the progress through each extended earley item. Currently we use the DottedRule class to represent the state of the parse so that would need to change. DottedRule is a linear subset of a finite state machine currently, though replacing it would ripple through the code base.

It looks like the other change is adding the "Extend" function to the list of current (Predict, Complete and Scan). This function appears to compete with the "Complete" function for items. I'd like to see a sample implementation to see how they handle the full algorithm and parse nodes. The paper is pretty high level in detail.