A simple (?) way of generating LALR(1)

Here's an idea that I think will generate LALR(1) parsers from LR(0) states in a simple way. This idea may be similar to the first step of lane table methods for generating LR(1) parsers from LR(0) states (e.g. phase 1 in The Lane Table Method Of Constructing LR(1) parsers) or may even be identical. I can't tell because my attention span was not enough to decipher that paper.

We start with LR(0) automaton. A reduce/reduce conflict is when in a state we see two or more items of form [T1 -> ... |], [T2 -> ... |] (T1 and T2 may be the same). A shift/reduce conflict is when in a state we see an item of form [T1 -> ...|] (i.e. a possible reduction) and another item of form [T2 -> ... | t ...] (i.e. a possible shift, t is a terminal, T1 and T2 may be the same).

In these cases we want to know the lookaheads of the reduction items as that will potentially allow us resolve some of the conflicts. For example, if we have a reduction that only makes sense when the next token is t1 (e.g. the LR(1) item [T1 -> ...|, t1]) and a shift of another terminal in the same state (e.g. [T2 -> ... |t2 ...]), where t2 is not the same as t1), then this is not a shift/reduce conflict anymore. Similar for reduce/reduce conflicts: when the lookaheads are different the conflict is resolved.

How to generate these lookeahead for these reduce items that are involved in a shift/reduce or reduce/reduce conflict?

Let's consider how lookaheads of LR(1) reduce items are generated:

The initial item will have $ (end-of-input) as the lookahead
An item is generated in one of these two ways:
1. By a shift from a previous state. E.g. when in a state we have [T1 -> ... | t1 ...], we generate a new state by "shifting" t1 with initial state [T1 -> ... t1 | ...].
2. By an "expansion" of a non-terminal in the current state. E.g. we have [T1 -> ... | T2 ...] in the current state, for each production in T1, we add the production to the current state.

In LR(1), the lookahead terminal for a state generated with step (1) will have the same lookahead as the "parent" item. (code)

In step (2), the lookahead of the new item will be the "first" set of the parent item's continuation after the expanded non-terminal. Note that when computing this "first" set, we add the parent item's lookead to the end of the item's continuation. E.g. in [T1 -> ... | <rest>, x] the "continuation" is <rest> x. (code)

So for each item, we know its dependencies:

If an item is in form [T -> ... t | ...], then it's generated by step (1) from an item in the previous state.
If an item is in form [T -> | ...], then it's generated by step (2) from an item in the current state.

It's easy to add parent item to an item when generating it. (TODO: There should be cycles in some cases, find an example?)

With this dependency graph, when we want to know lookahead of a reduction item (i.e. when it's involved in a shift/reduce or reduce/reduce conflict), we go backwards and try to obtain its lookahead.

If the current item is generated by step (1), we recurse to find the parent's lookahead
If the current item is generated by step (2), then we find the parent's "first" set after the current item's non-terminal. When the first set contains empty string, we will recurse to the parent's parent to find its lookahead.

I think we might want to add lookaheads to items as we visit them during this recursive process, as the lookahead of an item may be needed to resolve multiple conflicts. (TODO: Find an example?)

Note that this recursive process will return a set of lookaheads, not just a single lookahead. Also, because the dependency graph may have cycles, we need to keep track of visited items to avoid looping. The state at the beginning of this process should have {} as the lookahead set. At the end of the process the set should be non-empty.

That's it. I think this is probably the same as the lane table method phase 1. Since the number of states will be the same as LR(0) sets of the grammar, this algorithm should give us LALR(1). Lane table method phase 2 is about splitting some of the states/items to recover LR(1).

One of the things that I don't understand about the lane table method is for LR(1) grammars it somehow generates less number of states than Knuth's LR(1) algorithm (which is probably the one we have in parsegen). That suggests that, starting with Knuth's algorithm, we may be able to reduce the number of states without losing power. I don't remember this mentioned before, I thought LR(1) is LR(1) -- the sets are minimal (and huge).

http://smallcultfollowing.com/babysteps/blog/2017/03/17/the-lane-table-algorithm/ describes the lane table algorithm of the paper referenced above, and from a quick look, it seems like phase 1 (the LALR(1) generator) is basically the same as the algorithm I describe above.

I think to understand phase 2 of the lane table algorithm it may be helpful to look at a grammar that is LR(1) but not LALR(1), so it cannot be handled by the phase 1 without phase 2.

The blog post referenced above has this grammar which is LR(1) and not LALR(1): (which I think it the grammar in example 3 in the paper)

G1 = "a" X "d"
   | "a" Y "c"
   | "b" X "c"
   | "b" Y "d"
X  = "e" X
   | "e"
Y  = "e" Y
   | "e"

If I feed this into parsegen this is the LR(0) automaton for it:

0: {
  [Entry1 -> |a X d]
  [Entry1 -> |a Y c]
  [Entry1 -> |b Y d]
  [Entry1 -> |b X c]
  [Entry -> |Entry1]
  Entry1 -> 1
  a -> 2
  b -> 3
}
1: {
  [Entry -> Entry1|]
}
2: {
  [Entry1 -> a |X d]
  [Entry1 -> a |Y c]
  [X -> |e X]
  [X -> |e]
  [Y -> |e Y]
  [Y -> |e]
  Y -> 12
  e -> 4
  X -> 11
}
3: {
  [Entry1 -> b |Y d]
  [Entry1 -> b |X c]
  [X -> |e X]
  [X -> |e]
  [Y -> |e Y]
  [Y -> |e]
  Y -> 6
  e -> 4
  X -> 5
}
4: {
  [X -> |e X]
  [X -> e |X]
  [X -> |e]
  [X -> e|]
  [Y -> |e Y]
  [Y -> e |Y]
  [Y -> |e]
  [Y -> e|]
  X -> 10
  e -> 4
  Y -> 9
}
5: {
  [Entry1 -> b X |c]
  c -> 8
}
6: {
  [Entry1 -> b Y |d]
  d -> 7
}
7: {
  [Entry1 -> b Y d|]
}
8: {
  [Entry1 -> b X c|]
}
9: {
  [Y -> e Y|]
}
10: {
  [X -> e X|]
}
11: {
  [Entry1 -> a X |d]
  d -> 14
}
12: {
  [Entry1 -> a Y |c]
  c -> 13
}
13: {
  [Entry1 -> a Y c|]
}
14: {
  [Entry1 -> a X d|]
}

State 4 has shift/reduce and reduce/reduce conflicts, so we want to know the lookaheads in reducible items in state 4 as those will potentially resolve the conflicts.

Let's consider one of the items in state 4: [Y -> y|].

This is generated by a shift, so we want to know which item(s) in which state(s) generated this shift. We have to state, item pairs for it:

State 2, item [Y -> |y]
State 3, item [Y -> |y]

First item above generates lookahead c. Second item generates lookahead d.

Similarly we find the lookaheads for the other reducible item in state 4 which is X -> e| and the lookaheads is c and d.

So state 4 with lookaheads is:

4: {
  [X -> |e X]
  [X -> e |X]
  [X -> |e]
  [X -> e|, {c, d}]
  [Y -> |e Y]
  [Y -> e |Y]
  [Y -> |e]
  [Y -> e|, {c, d}]
  X -> 10
  e -> 4
  Y -> 9
}

Shift/reduce conflicts are resolved, but we still have a reduce/reduce conflict. So our algorithm (which I think should be LALR(1)) cannot handle this grammar.

Now let's look at LR(1) automaton for this grammar to see how this case is handled:

0: {
  [Entry1 -> |a X d, $]
  [Entry1 -> |a Y c, $]
  [Entry1 -> |b Y d, $]
  [Entry1 -> |b X c, $]
  [Entry -> |Entry1, $]
  GOTO Entry1 -> 1
  GOTO a -> 2
  GOTO b -> 3
}
1: {
  [Entry -> Entry1|, $]
}
2: {
  [Entry1 -> a |X d, $]
  [Entry1 -> a |Y c, $]
  [X -> |e X, d]
  [X -> |e, d]
  [Y -> |e Y, c]
  [Y -> |e, c]
  GOTO e -> 6
  GOTO X -> 4
  GOTO Y -> 5
}
3: {
  [Entry1 -> b |Y d, $]
  [Entry1 -> b |X c, $]
  [X -> |e X, c]
  [X -> |e, c]
  [Y -> |e Y, d]
  [Y -> |e, d]
  GOTO e -> 9
  GOTO X -> 7
  GOTO Y -> 8
}
4: {
  [Entry1 -> a X |d, $]
  GOTO d -> 10
}
5: {
  [Entry1 -> a Y |c, $]
  GOTO c -> 11
}
6: {
  [X -> |e X, d]
  [X -> e |X, d]
  [X -> |e, d]
  [X -> e|, d]
  [Y -> |e Y, c]
  [Y -> e |Y, c]
  [Y -> |e, c]
  [Y -> e|, c]
  GOTO e -> 6
  GOTO X -> 12
  GOTO Y -> 13
}
7: {
  [Entry1 -> b X |c, $]
  GOTO c -> 14
}
8: {
  [Entry1 -> b Y |d, $]
  GOTO d -> 15
}
9: {
  [X -> |e X, c]
  [X -> e |X, c]
  [X -> |e, c]
  [X -> e|, c]
  [Y -> |e Y, d]
  [Y -> e |Y, d]
  [Y -> |e, d]
  [Y -> e|, d]
  GOTO e -> 9
  GOTO X -> 16
  GOTO Y -> 17
}
10: {
  [Entry1 -> a X d|, $]
}
11: {
  [Entry1 -> a Y c|, $]
}
12: {
  [X -> e X|, d]
}
13: {
  [Y -> e Y|, c]
}
14: {
  [Entry1 -> b X c|, $]
}
15: {
  [Entry1 -> b Y d|, $]
}
16: {
  [X -> e X|, c]
}
17: {
  [Y -> e Y|, d]
}

(To be continued...)

To see the difference, let's consider an input that will cause the LALR(1) version to get stuck: "aed".

The LALR(1) states will go like this:

states: [0]
symbols: []

==> a

states: [0, 2]
symbols: [a]

==> e

states: [0, 1, 4]
symbols: [a, e]

At this point we're stuck, because state 4 has two possible reductions when the next symbol is "d".

Now let's see how LR(1) version handles it:

states: [0]
symbols: []

==> a

states: [0, 2]
symbols: [a]

==> e

states: [0, 2, 6]
symbols: [a, e]

==> d

states: [0, 2, 4, 10]
symbols: [a, X, d]

So after the prefix "ae" we get stuck in LALR(1) version, but we can make progress in LR(1) version.

The problem seems to be that the LALR(1) state after the prefix "ae"

4: {
  [X -> |e X]
  [X -> e |X]
  [X -> |e]
  [X -> e|, {c, d}]
  [Y -> |e Y]
  [Y -> e |Y]
  [Y -> |e]
  [Y -> e|, {c, d}]
}

Is the "union" of these two states in LR(1) version:

6: {
  [X -> |e X, d]
  [X -> e |X, d]
  [X -> |e, d]
  [X -> e|, d]
  [Y -> |e Y, c]
  [Y -> e |Y, c]
  [Y -> |e, c]
  [Y -> e|, c]
}

9: {
  [X -> |e X, c]
  [X -> e |X, c]
  [X -> |e, c]
  [X -> e|, c]
  [Y -> |e Y, d]
  [Y -> e |Y, d]
  [Y -> |e, d]
  [Y -> e|, d]
}

Note that these two states are identical if we remove the lookaheads.

In LR(1) version, keeping track of lookaheads even in shiftable items and considering items different when their lookaheads differ gives us more precise tracking of context (i.e. how we end up in a state), which allows us avoiding some of the conflicts that the LALR(1) version has. In LALR(1) version, the inputs "ae" and "be" lead us to the same state (state 4). In LR(1) version, if we see "a", we keep track of the fact that we can only reduce to X when the lookahead is d, and reduce to Y when the lookahead in c, and similarly when we start with "b".

So if we want to turn our LALR(1) parser into an LR(1) one, we need to somehow split some of the states into multiple states to allow maintaining lookaheads based on how we end up in a state.

This is what the second phase of lane table algorithm does, but I still don't understand how it does this..

I think I started to understand phase 2 of the lane table algorithm.

Let's consider this state in our LALR(1) states:

4: {
  [X -> |e X]
  [X -> e |X]
  [X -> |e]
  [X -> e|, {c, d}]
  [Y -> |e Y]
  [Y -> e |Y]
  [Y -> |e]
  [Y -> e|, {c, d}]
}

We want to resolve the reduce/reduce conflict in this state. I think the idea is this: in the algorithm described above to generate this LALR(1) state, we visit predecessors of the reduce items and try to find the lookahead symbols. As we do that, we create a table that maps which state contributes what lookahead to the reduce actions. We first give names to reduce items:

A1 = [X -> e|]
A2 = [Y -> e|]

We have two paths from the initial state to state 4:

0, 2, 4
0, 3, 4

So our table looks like this:

States/Actions	A1	A2
2	d	c
3	c	d
4	{d,c}	{c,d}

This says that state 3 adds lookahead c to item 1 ([X -> e|]), adds lookahead d to item 2, and so on.

State 0 is omitted (I think) because it doesn't contribute any lookaheads to A1 or A2. (we don't even visit state 0 in our LALR(1) algorithm)

Now the idea is that we want to introduce some new states so that we'll have a states similar to our LR(1) states, which allowed better tracking of lookaheads and avoided the reduce/reduce conflict we have in the LALR(1) states.

For that, we start with "lanehead" states and do depth-first traversal. (TODO: Would breadth-first work? Would it be less efficient? I think depth-first is just becuase it's easier to implement using recursive calls)

A lanehead state is a state in the table that no other state in the table goes through. In the table above, these are states 2 and 3.

(There will always be a lane state: in the worst case it will be the initial state with item [S0 -> | S]. Since no other state can have the item [S0 -> | S] (because S0 is the special initial non-terminal) there won't be any incoming edges to the initial state, so it won't have any predecessors)

Starting with lanehead states, we do depth-first traversal, and collect the lookaheads. We end up with two lanes:

[2, 4]: lookaheads {A1: {d}, A2: {c}}
[3, 4]: lookaheads {A1: {c}, A2: {d}}

When we look at the lookaheads of actions in each of these lanes, we see that none of them have reduce/reduce conflicts.

What this tells us is that if we had two lanes for [0, 2, 4] and [0, 3, 4] we wouldn't have the reduce/reduce conflicts.

Now, in the example above it's clear that we want to split state 4, say to A and B, and have two lanes [0, 2, A] and [0, 3, B].

But it's not clear to me how to split these lanes in general. I'm hoping to figure this out next

I think splitting states probably works like this.

Starting with lanehead states we do depth-first traversal again, and for each state, check if we can merge it with its successors.

The treversal will visit states in this order:

[2, 4]
[3, 4]

In the first case, we can merge contexts of 2 and 4 without any conflicts, so we do it and get lookaheads {A1: {d}, A2: {c}}.

Now we consider states 2 and 4 as having the same context which is as shown above.

When we check [3, 4] we try to merge 3 and 4. Since 4 has context {A1: {d}, A2: {c}}, merging context of 3 with 4 causes a conflict, so we cannot merge contexts of 3 and 4. Instead we duplicate state 4, let's call it 4_2.

So now the second lane is [3, 4_2] and we can merge these two to have context {A1: {c}, A2: {d}} which again doesn't have any conflicts.

Here are the LR(0) and LR(1) automaton graphs for our LR(1) grammar example:

lr0

lr1

because the dependency graph may have cycles, we need to keep track of visited items to avoid looping.

Have you considered looking at the SCCs of the dependency graph instead of only considering cycles? As I said on Reddit, I think the relations-based algorithm is the simplest one, and I think it is exactly what you are looking for.

One of the things that I don't understand about the lane table method is for LR(1) grammars it somehow generates less number of states than Knuth's LR(1) algorithm (which is probably the one we have in parsegen). That suggests that, starting with Knuth's algorithm, we may be able to reduce the number of states without losing power. I don't remember this mentioned before, I thought LR(1) is LR(1) -- the sets are minimal (and huge).

Do you know if the redundant states that I've highlighted in https://github.com/osa1/parsegen/issues/11 are also redundant when the lane table method is used for constructing LR(1)? Or are they already merged there?

osa1 / parsegen

A simple (?) way of generating LALR(1) #8