mdaines / grammophone

A tool for analyzing and transforming context-free grammars.
https://mdaines.github.io/grammophone
MIT License
200 stars 23 forks source link

Support {read, include}-related diagnostics. #33

Closed modulovalue closed 3 months ago

modulovalue commented 11 months ago

There are two theorems that can be used for detecting whether a grammar can't be LR(k) for any k.

One uses the includes relation in the DeRemer and Pennello approach: https://github.com/mdaines/grammophone/issues/26#issuecomment-1657452012

The other one uses the reads relation in the DeRemer and Pennello approach: https://github.com/mdaines/grammophone/issues/26#issuecomment-1657521978

Despite everybody stating the opposite, the DeRemer and Pennello algorithm is very simple. one needs to just collect a bunch of facts while traversing the grammar, construct a graph from those facts, find the SCCs and a topological sorting of that graph (the Tarjan SCC finding algorithm does both at the same time), and propagate lookaheads through the graph until fix points have been reached.

Intuitively, it looks to me like, those theorems can be restated by saying that if multiple paths in the grammar flow graph share the same nodes, then the grammar is known not to be LR(k) because multiple nodes can end up on the same path.

In any case, I think it would be useful if grammophone could one day support visualizing these diagnostics.

~One challenge with doing that would be that graphviz doesn't seem to support drawing edges from an edge to another edge. The DeRemer and Pennello approach works on transitions of the DFA, not on states of the DFA.~

~Here's how they visualize the includes relation in their paper:~

Bildschirm­foto 2023-07-31 um 17 09 05

~I guess one could simulate that with graphviz by turning each edge into 2 edges connected by an invisible node, but the practicality of that approach would have to be investigated.~

mdaines commented 11 months ago

I suppose you could fake the diagram like this:

digraph {

  start -> A [dir=none]
  A -> x1

  start -> b [dir=none]
  b -> x2
  x2 -> B [dir=none]
  B -> x3

  B -> A [style=dashed]

  A [shape=point, xlabel="A"]
  b [shape=point, xlabel="b"]
  B [shape=point, xlabel="B"]

  x1 [shape=circle, label=""]
  x2 [shape=circle, label=""]
  x3 [shape=circle, label=""]

  {
    rank=same
    start
    A
    x1
  }

  {
    rank=same

    b
    x2
    B
    x3
  }

}
mdaines commented 11 months ago

I meant to add an SVG rendering of that:

Unknown

modulovalue commented 11 months ago

I love that xlabel attribute trick in your example, thank you for sharing, I haven't seen that one before.

mdaines commented 11 months ago

In general, I like the idea of adding more "diagnostics". I was experimenting with something simpler recently -- adding a diagram that shows the relation used to find unreachable nonterminals. I have some ideas to improve that sort of thing, because it's sort of a pain right now.

What I think I'd like to do is structure diagnostics as plugins, with a default set, and provide a way to load more within Grammophone, kind of like the Babel REPL does.

modulovalue commented 11 months ago

[...] adding a diagram that shows the relation used to find unreachable nonterminals. I have some ideas to improve that sort of thing,

I'm currently handling this by looking at the CCs of just the context free grammar. If there's only one CC, then there are no unreachable nonterminals because somebody refers to it and any CC that does not contain the start symbol is unreachable. I definitely think that this would be a good diagnostic to have.

Edit: correction, of course, I meant to say connected components, not strongly connected components.

modulovalue commented 11 months ago

Here's a manually annotated LR(0) automaton with the includes relations from the issue description:

Grammophone grammar

Bildschirm­foto 2023-07-31 um 21 24 06

In green, the includes relation, in blue, the nontrivial includes SCC, in red, a nonempty "direct-read" that turns the SCC into a "bad" SCC which turns this grammar non-LR(k) for any k. As stated in the paper, removing the B -> c C f production removes the "direct-read" on f and repairs, but changes, the grammar.

modulovalue commented 11 months ago

A different idea: adding edges between edges to the SVG could be a post-processing step after graphviz has rendered the SVG. The last time I checked, the SVGs rendered by graphviz can contain ids for some elements if ids have been given in the source dot file, parsing the SVG and extracting coordinates should then not be too hard.

modulovalue commented 11 months ago

And, to get a better picture of what read edges might look like in grammophone, here's the read annotated LR(0) automaton (See https://github.com/mdaines/grammophone/issues/26#issuecomment-1657521978 for the grammar).

Bildschirm­foto 2023-07-31 um 22 44 21 Bildschirm­foto 2023-07-31 um 22 39 26

In green, the read relation (not to be confused with what DeRemer calls "Read" with a capital R), and in blue, the SCC that the read relation induces.

The nontrivial SCC here is enough to determine that the grammar is not LR(k) for any k, nothing else needs to be considered.


Note that making any of the nullable nonterminals not nullable by, for example, making them describe a single terminal "x", the SCC can be broken and the grammar is "restored".

There is a novel systematic way for removing nullable nonterminals, known as the CPS-transform introduced by langcc (https://github.com/mdaines/grammophone/issues/29) applying it to the grammar above makes it even SLR(1).


Edit: things appear to become much simpler by having the read relation point from DFAState to DFAState instead of transition to transition.

modulovalue commented 11 months ago

I'm investigating a POC for https://github.com/mdaines/grammophone/issues/33#issuecomment-1659055438

this results in an SVG that has the following shape:

...<g id="foo" class="edge">...

that is, edges can be accurately found by going through all g elements within the svg element of an SVG.

I think, a naive approach of drawing straight lines would be useful, even if it is not pretty.

I'm going to give this a go, highlighting SCCs would also be pretty simple, a translucent polygon could be added to highlight all edges of an SCC.

I conjecture that the theorems that DeRemer introduced will become relevant when going to LALR(k) where k is greater than 1. They don't seem to be all that useful when k is 1.

modulovalue commented 11 months ago

for future reference:

steveroush pointed out in https://gitlab.com/graphviz/graphviz/-/issues/1656#note_1496402295 that a group attribute can be used instead of ranks to put nodes into a single row:

digraph {
  rankdir="LR"
  {node[group=Aa] start; A; x1; }
  {node[group=Bb] b; x2; B; x3; }
  start -> A [dir=none]
  A -> x1
  start -> b [dir=none]
  b -> x2
  x2 -> B [dir=none]
  B -> x3
  B -> A [style=dashed]
  A [shape=point, xlabel="A"]
  b [shape=point, xlabel="b"]
  B [shape=point, xlabel="B"]
  x1 [shape=circle, label=""]
  x2 [shape=circle, label=""]
  x3 [shape=circle, label=""]
}

Note: the subgraphs containing node[group=...] must come before the nodes are being declared.

Which results in the following graph:

Bildschirm­foto 2023-08-02 um 17 32 35
modulovalue commented 11 months ago

(context: https://github.com/mdaines/grammophone/issues/33#issuecomment-1659161152)

POC for adding custom edges to an exported SVG:

Bildschirm­foto 2023-08-07 um 18 49 15

https://github.com/mdaines/grammophone/issues/33#issuecomment-1660600675 is definitely doable.

modulovalue commented 11 months ago

I've been trying a bunch of things, and I have come to the conclusion that showing the DFA + all the constraints of the DeRemer & Pennello algorithm for constructing LALR(1) lookaheads is not yet feasible in a form that is practical, and not useful enough to be worth solving.

I'm going to close this issue for now. Hopefully someday things will change, and there will be a more flexible library for doing things like that.

modulovalue commented 7 months ago

I'm working on a new algorithm for upgrading LALR(1) automata to LR(1) and the DeRemer read and include relations appear to be pretty fundamental to the act of building LR(1) automata efficiently.

This issue focused on augmenting existing automata visualizations with read and include sets. This is hard to visualize and while it would be useful, it probably is still not worth the effort.

However, I think that visualizing all LALR relations (while ignoring DFA-related structural details) and providing metrics related to read and include sets is very valuable. They capture precisely why lookaheads exist, which is important for debugging purposes (to answer the question: why does X have lookahead Y?). I'm reopening this issue for that reason.

modulovalue commented 3 months ago

I no longer agree with the statement that pursuing this would be worthwhile. While this would be nice to have, it would be much more useful to actually show ambiguities that LALR(k) can't resolve and that approach would supersede this while requiring a similar amount of effort.