osa1 / parsegen

An LR parser generator, implemented as a proc macro
MIT License
15 stars 0 forks source link

A simple (?) way of generating LALR(1) #8

Open osa1 opened 1 year ago

osa1 commented 1 year ago

Here's an idea that I think will generate LALR(1) parsers from LR(0) states in a simple way. This idea may be similar to the first step of lane table methods for generating LR(1) parsers from LR(0) states (e.g. phase 1 in The Lane Table Method Of Constructing LR(1) parsers) or may even be identical. I can't tell because my attention span was not enough to decipher that paper.

We start with LR(0) automaton. A reduce/reduce conflict is when in a state we see two or more items of form [T1 -> ... |], [T2 -> ... |] (T1 and T2 may be the same). A shift/reduce conflict is when in a state we see an item of form [T1 -> ...|] (i.e. a possible reduction) and another item of form [T2 -> ... | t ...] (i.e. a possible shift, t is a terminal, T1 and T2 may be the same).

In these cases we want to know the lookaheads of the reduction items as that will potentially allow us resolve some of the conflicts. For example, if we have a reduction that only makes sense when the next token is t1 (e.g. the LR(1) item [T1 -> ...|, t1]) and a shift of another terminal in the same state (e.g. [T2 -> ... |t2 ...]), where t2 is not the same as t1), then this is not a shift/reduce conflict anymore. Similar for reduce/reduce conflicts: when the lookaheads are different the conflict is resolved.

How to generate these lookeahead for these reduce items that are involved in a shift/reduce or reduce/reduce conflict?

Let's consider how lookaheads of LR(1) reduce items are generated:

In LR(1), the lookahead terminal for a state generated with step (1) will have the same lookahead as the "parent" item. (code)

In step (2), the lookahead of the new item will be the "first" set of the parent item's continuation after the expanded non-terminal. Note that when computing this "first" set, we add the parent item's lookead to the end of the item's continuation. E.g. in [T1 -> ... | <rest>, x] the "continuation" is <rest> x. (code)

So for each item, we know its dependencies:

It's easy to add parent item to an item when generating it. (TODO: There should be cycles in some cases, find an example?)

With this dependency graph, when we want to know lookahead of a reduction item (i.e. when it's involved in a shift/reduce or reduce/reduce conflict), we go backwards and try to obtain its lookahead.

I think we might want to add lookaheads to items as we visit them during this recursive process, as the lookahead of an item may be needed to resolve multiple conflicts. (TODO: Find an example?)

Note that this recursive process will return a set of lookaheads, not just a single lookahead. Also, because the dependency graph may have cycles, we need to keep track of visited items to avoid looping. The state at the beginning of this process should have {} as the lookahead set. At the end of the process the set should be non-empty.

That's it. I think this is probably the same as the lane table method phase 1. Since the number of states will be the same as LR(0) sets of the grammar, this algorithm should give us LALR(1). Lane table method phase 2 is about splitting some of the states/items to recover LR(1).

One of the things that I don't understand about the lane table method is for LR(1) grammars it somehow generates less number of states than Knuth's LR(1) algorithm (which is probably the one we have in parsegen). That suggests that, starting with Knuth's algorithm, we may be able to reduce the number of states without losing power. I don't remember this mentioned before, I thought LR(1) is LR(1) -- the sets are minimal (and huge).

osa1 commented 1 year ago

http://smallcultfollowing.com/babysteps/blog/2017/03/17/the-lane-table-algorithm/ describes the lane table algorithm of the paper referenced above, and from a quick look, it seems like phase 1 (the LALR(1) generator) is basically the same as the algorithm I describe above.

osa1 commented 1 year ago

I think to understand phase 2 of the lane table algorithm it may be helpful to look at a grammar that is LR(1) but not LALR(1), so it cannot be handled by the phase 1 without phase 2.

The blog post referenced above has this grammar which is LR(1) and not LALR(1): (which I think it the grammar in example 3 in the paper)

G1 = "a" X "d"
   | "a" Y "c"
   | "b" X "c"
   | "b" Y "d"
X  = "e" X
   | "e"
Y  = "e" Y
   | "e"

If I feed this into parsegen this is the LR(0) automaton for it:

0: {
  [Entry1 -> |a X d]
  [Entry1 -> |a Y c]
  [Entry1 -> |b Y d]
  [Entry1 -> |b X c]
  [Entry -> |Entry1]
  Entry1 -> 1
  a -> 2
  b -> 3
}
1: {
  [Entry -> Entry1|]
}
2: {
  [Entry1 -> a |X d]
  [Entry1 -> a |Y c]
  [X -> |e X]
  [X -> |e]
  [Y -> |e Y]
  [Y -> |e]
  Y -> 12
  e -> 4
  X -> 11
}
3: {
  [Entry1 -> b |Y d]
  [Entry1 -> b |X c]
  [X -> |e X]
  [X -> |e]
  [Y -> |e Y]
  [Y -> |e]
  Y -> 6
  e -> 4
  X -> 5
}
4: {
  [X -> |e X]
  [X -> e |X]
  [X -> |e]
  [X -> e|]
  [Y -> |e Y]
  [Y -> e |Y]
  [Y -> |e]
  [Y -> e|]
  X -> 10
  e -> 4
  Y -> 9
}
5: {
  [Entry1 -> b X |c]
  c -> 8
}
6: {
  [Entry1 -> b Y |d]
  d -> 7
}
7: {
  [Entry1 -> b Y d|]
}
8: {
  [Entry1 -> b X c|]
}
9: {
  [Y -> e Y|]
}
10: {
  [X -> e X|]
}
11: {
  [Entry1 -> a X |d]
  d -> 14
}
12: {
  [Entry1 -> a Y |c]
  c -> 13
}
13: {
  [Entry1 -> a Y c|]
}
14: {
  [Entry1 -> a X d|]
}

State 4 has shift/reduce and reduce/reduce conflicts, so we want to know the lookaheads in reducible items in state 4 as those will potentially resolve the conflicts.

Let's consider one of the items in state 4: [Y -> y|].

This is generated by a shift, so we want to know which item(s) in which state(s) generated this shift. We have to state, item pairs for it:

First item above generates lookahead c. Second item generates lookahead d.

Similarly we find the lookaheads for the other reducible item in state 4 which is X -> e| and the lookaheads is c and d.

So state 4 with lookaheads is:

4: {
  [X -> |e X]
  [X -> e |X]
  [X -> |e]
  [X -> e|, {c, d}]
  [Y -> |e Y]
  [Y -> e |Y]
  [Y -> |e]
  [Y -> e|, {c, d}]
  X -> 10
  e -> 4
  Y -> 9
}

Shift/reduce conflicts are resolved, but we still have a reduce/reduce conflict. So our algorithm (which I think should be LALR(1)) cannot handle this grammar.

Now let's look at LR(1) automaton for this grammar to see how this case is handled:

0: {
  [Entry1 -> |a X d, $]
  [Entry1 -> |a Y c, $]
  [Entry1 -> |b Y d, $]
  [Entry1 -> |b X c, $]
  [Entry -> |Entry1, $]
  GOTO Entry1 -> 1
  GOTO a -> 2
  GOTO b -> 3
}
1: {
  [Entry -> Entry1|, $]
}
2: {
  [Entry1 -> a |X d, $]
  [Entry1 -> a |Y c, $]
  [X -> |e X, d]
  [X -> |e, d]
  [Y -> |e Y, c]
  [Y -> |e, c]
  GOTO e -> 6
  GOTO X -> 4
  GOTO Y -> 5
}
3: {
  [Entry1 -> b |Y d, $]
  [Entry1 -> b |X c, $]
  [X -> |e X, c]
  [X -> |e, c]
  [Y -> |e Y, d]
  [Y -> |e, d]
  GOTO e -> 9
  GOTO X -> 7
  GOTO Y -> 8
}
4: {
  [Entry1 -> a X |d, $]
  GOTO d -> 10
}
5: {
  [Entry1 -> a Y |c, $]
  GOTO c -> 11
}
6: {
  [X -> |e X, d]
  [X -> e |X, d]
  [X -> |e, d]
  [X -> e|, d]
  [Y -> |e Y, c]
  [Y -> e |Y, c]
  [Y -> |e, c]
  [Y -> e|, c]
  GOTO e -> 6
  GOTO X -> 12
  GOTO Y -> 13
}
7: {
  [Entry1 -> b X |c, $]
  GOTO c -> 14
}
8: {
  [Entry1 -> b Y |d, $]
  GOTO d -> 15
}
9: {
  [X -> |e X, c]
  [X -> e |X, c]
  [X -> |e, c]
  [X -> e|, c]
  [Y -> |e Y, d]
  [Y -> e |Y, d]
  [Y -> |e, d]
  [Y -> e|, d]
  GOTO e -> 9
  GOTO X -> 16
  GOTO Y -> 17
}
10: {
  [Entry1 -> a X d|, $]
}
11: {
  [Entry1 -> a Y c|, $]
}
12: {
  [X -> e X|, d]
}
13: {
  [Y -> e Y|, c]
}
14: {
  [Entry1 -> b X c|, $]
}
15: {
  [Entry1 -> b Y d|, $]
}
16: {
  [X -> e X|, c]
}
17: {
  [Y -> e Y|, d]
}

(To be continued...)

osa1 commented 1 year ago

To see the difference, let's consider an input that will cause the LALR(1) version to get stuck: "aed".

The LALR(1) states will go like this:

states: [0]
symbols: []

==> a

states: [0, 2]
symbols: [a]

==> e

states: [0, 1, 4]
symbols: [a, e]

At this point we're stuck, because state 4 has two possible reductions when the next symbol is "d".

Now let's see how LR(1) version handles it:

states: [0]
symbols: []

==> a

states: [0, 2]
symbols: [a]

==> e

states: [0, 2, 6]
symbols: [a, e]

==> d

states: [0, 2, 4, 10]
symbols: [a, X, d]

So after the prefix "ae" we get stuck in LALR(1) version, but we can make progress in LR(1) version.

The problem seems to be that the LALR(1) state after the prefix "ae"

4: {
  [X -> |e X]
  [X -> e |X]
  [X -> |e]
  [X -> e|, {c, d}]
  [Y -> |e Y]
  [Y -> e |Y]
  [Y -> |e]
  [Y -> e|, {c, d}]
}

Is the "union" of these two states in LR(1) version:

6: {
  [X -> |e X, d]
  [X -> e |X, d]
  [X -> |e, d]
  [X -> e|, d]
  [Y -> |e Y, c]
  [Y -> e |Y, c]
  [Y -> |e, c]
  [Y -> e|, c]
}

9: {
  [X -> |e X, c]
  [X -> e |X, c]
  [X -> |e, c]
  [X -> e|, c]
  [Y -> |e Y, d]
  [Y -> e |Y, d]
  [Y -> |e, d]
  [Y -> e|, d]
}

Note that these two states are identical if we remove the lookaheads.

In LR(1) version, keeping track of lookaheads even in shiftable items and considering items different when their lookaheads differ gives us more precise tracking of context (i.e. how we end up in a state), which allows us avoiding some of the conflicts that the LALR(1) version has. In LALR(1) version, the inputs "ae" and "be" lead us to the same state (state 4). In LR(1) version, if we see "a", we keep track of the fact that we can only reduce to X when the lookahead is d, and reduce to Y when the lookahead in c, and similarly when we start with "b".

So if we want to turn our LALR(1) parser into an LR(1) one, we need to somehow split some of the states into multiple states to allow maintaining lookaheads based on how we end up in a state.

This is what the second phase of lane table algorithm does, but I still don't understand how it does this..

osa1 commented 1 year ago

I think I started to understand phase 2 of the lane table algorithm.

Let's consider this state in our LALR(1) states:

4: {
  [X -> |e X]
  [X -> e |X]
  [X -> |e]
  [X -> e|, {c, d}]
  [Y -> |e Y]
  [Y -> e |Y]
  [Y -> |e]
  [Y -> e|, {c, d}]
}

We want to resolve the reduce/reduce conflict in this state. I think the idea is this: in the algorithm described above to generate this LALR(1) state, we visit predecessors of the reduce items and try to find the lookahead symbols. As we do that, we create a table that maps which state contributes what lookahead to the reduce actions. We first give names to reduce items:

A1 = [X -> e|]
A2 = [Y -> e|]

We have two paths from the initial state to state 4:

So our table looks like this:

States/Actions A1 A2
2 d c
3 c d
4 {d,c} {c,d}

This says that state 3 adds lookahead c to item 1 ([X -> e|]), adds lookahead d to item 2, and so on.

State 0 is omitted (I think) because it doesn't contribute any lookaheads to A1 or A2. (we don't even visit state 0 in our LALR(1) algorithm)

Now the idea is that we want to introduce some new states so that we'll have a states similar to our LR(1) states, which allowed better tracking of lookaheads and avoided the reduce/reduce conflict we have in the LALR(1) states.

For that, we start with "lanehead" states and do depth-first traversal. (TODO: Would breadth-first work? Would it be less efficient? I think depth-first is just becuase it's easier to implement using recursive calls)

A lanehead state is a state in the table that no other state in the table goes through. In the table above, these are states 2 and 3.

(There will always be a lane state: in the worst case it will be the initial state with item [S0 -> | S]. Since no other state can have the item [S0 -> | S] (because S0 is the special initial non-terminal) there won't be any incoming edges to the initial state, so it won't have any predecessors)

Starting with lanehead states, we do depth-first traversal, and collect the lookaheads. We end up with two lanes:

When we look at the lookaheads of actions in each of these lanes, we see that none of them have reduce/reduce conflicts.

What this tells us is that if we had two lanes for [0, 2, 4] and [0, 3, 4] we wouldn't have the reduce/reduce conflicts.

Now, in the example above it's clear that we want to split state 4, say to A and B, and have two lanes [0, 2, A] and [0, 3, B].

But it's not clear to me how to split these lanes in general. I'm hoping to figure this out next

osa1 commented 1 year ago

I think splitting states probably works like this.

Starting with lanehead states we do depth-first traversal again, and for each state, check if we can merge it with its successors.

The treversal will visit states in this order:

In the first case, we can merge contexts of 2 and 4 without any conflicts, so we do it and get lookaheads {A1: {d}, A2: {c}}.

Now we consider states 2 and 4 as having the same context which is as shown above.

When we check [3, 4] we try to merge 3 and 4. Since 4 has context {A1: {d}, A2: {c}}, merging context of 3 with 4 causes a conflict, so we cannot merge contexts of 3 and 4. Instead we duplicate state 4, let's call it 4_2.

So now the second lane is [3, 4_2] and we can merge these two to have context {A1: {c}, A2: {d}} which again doesn't have any conflicts.

osa1 commented 1 year ago

Here are the LR(0) and LR(1) automaton graphs for our LR(1) grammar example:

lr0


lr1

modulovalue commented 1 year ago

because the dependency graph may have cycles, we need to keep track of visited items to avoid looping.

Have you considered looking at the SCCs of the dependency graph instead of only considering cycles? As I said on Reddit, I think the relations-based algorithm is the simplest one, and I think it is exactly what you are looking for.

modulovalue commented 1 year ago

One of the things that I don't understand about the lane table method is for LR(1) grammars it somehow generates less number of states than Knuth's LR(1) algorithm (which is probably the one we have in parsegen). That suggests that, starting with Knuth's algorithm, we may be able to reduce the number of states without losing power. I don't remember this mentioned before, I thought LR(1) is LR(1) -- the sets are minimal (and huge).

Do you know if the redundant states that I've highlighted in https://github.com/osa1/parsegen/issues/11 are also redundant when the lane table method is used for constructing LR(1)? Or are they already merged there?