otac0n / Pegasus

A PEG parser generator for .NET that integrates with MSBuild and Visual Studio.
http://otac0n.com/Pegasus/
MIT License
205 stars 38 forks source link

Left-recursion allowed in circumstances that will always lead to a stack overflow. #54

Open otac0n opened 10 years ago

otac0n commented 10 years ago
expr -memoize
  = #STATE{ state["any"] = 1; } expr "a"
  / "b"

Since this changes the state of the cursor just before recursing into expr, and since memoization is used as an implementation detail of left-recursion, this alters the state of the left-recursion algorithm.

That's bad.

Basically, any left recursive call that can be preceded by a state change is a no-no.

Need to add a detector and a compiler error for this.

otac0n commented 10 years ago

I'm not sure if this is strictly computable, but my hunch is that it is.

Here's my half-baked algorithm for figuring this out:

  1. Build a control flow graph of the entire grammar.
  2. Find all nodes relating to #STATE expressions in the graph.
  3. Find all succeeding nodes from those state nodes.
  4. Mark these nodes as being reached by the aforementioned #STATE expression.
  5. Repeat the following process until no reachable nodes can be added:
    1. Find any zero width node that is reached by a #STATE expression
    2. Find the succeeding nodes from those nodes
    3. Mark these nodes as being reached by the #STATE expressions
  6. For each left-recursive rule
    1. Find all paths* through the control flow graph that reach a name node referring to the rule.
    2. Filter these paths to just those that contain only zero-width expressions
    3. Filter these paths to just those that end with a node that is reachable by any #STATE expression
    4. The remaining paths are guaranteed to stack overflow (unless the programmer adds code expressions to prevent it).

The definition of *all paths above is dubious, since there could easily be infinitely many paths through the graph. It may be sufficient (since we are talking about zero-width expressions) to use all strongly connected zero-width paths that touch a rule starting node and touch a name expression for that same rule.

In any case, this should only ever be a compiler warning, since code expressions can technically avoid it, but the workbench needs to be hardened against stack overflows.

otac0n commented 10 years ago

Oh yeah, and here is possibly the most mind-bending error case:

expr -memoize
  = expr expr "foo"
  / #STATE{ state["bar"] = 1; }