softdevteam / lrpar

Rust LR parser
Other
1 stars 0 forks source link

MF: a new recovery algorithm #40

Closed ltratt closed 6 years ago

ltratt commented 6 years ago

This PR looks scary, but it's not quite as bad as it first looks if you take the commits one-by-one. The first few commits are boiler-plate stuff. https://github.com/softdevteam/lrpar/commit/ba35e6aed3ae8fc8be8d9d94901a03aff600ae26 is the main commit, giving us a completely new recovery algorithm. https://github.com/softdevteam/lrpar/pull/40/commits/03de1fa669e0dba4841c55b8101f23ad612ea807 is a relatively stand-alone performance change. The other latter commits are relatively simple changes. The important commits are extensively documented (hopefully) in both the commit messages and (where useful) in the code.

The problem this PR tries to solve is that the KimYi recoverer (either in KimYi or KimYiPlus modes) is flawed. Here's a simple grammar:

S: T U 'C';
T: 'A';
U: 'B';

Given the input "c", KimYi can't find any repairs, even though it looks like it should.

This PR introduces a new recoverer "MF" (the name is mostly meaningless), which is heavily inspired by KimYi. For the above example, it finds the repair:

Insert 'A', Insert 'B'

Sometimes KimYi's flaw means that it finds a longer repair than it should. Consider this grammar:

S:  'A' 'B' 'D' | 'A' 'B' 'C' 'A' 'A' 'D';

Give the input "acd", KimYi finds the following repair:

  Insert "B", Shift, Insert "A", Insert "A"

Whereas MF finds the following shorter repair:

  Insert "B", Delete

To make reviewing easier, MF reuses as much of KimYiPlus's code as possible, which means you can generally flick between mf.rs and kimyi_plus.rs to see what the differences are.

To cut a long story short, MF's major change is to rethink the dist() function. From a state S, KimYi tells us how many symbols need to be inserted to reach a terminal T. It does this by following edges with terminals in the stategraph. This fails to take into accout reductions and gotos, which means that KimYi::dist() often overestimates the true distance to T or incorrectly claims that T is unreachable. This then causes the A algorithm to generate incorrect results (A works fine if dist() underestimates the distance, but doesn't work properly if dist() overestimates).

MF's dist() algorithm takes into account reductions and gotos. This means that MF's dist is more conservative / pessimistic, but it means that it never overestimates the distance. The current MF::dist is pretty slow (taking several times as long as KimYi::dist), although that could be optimised a fair bit in the future. The changes to dist() then have a few knock-on effects (mostly simplifications) to the rules discovering repairs.

The good news is that, overall, performance is almost identical between MF and KimYiPlus (mostly because the new A* algorithm in this PR wins back what we lose from having a more conservative dist() function). There is definitely a need for more optimisation, but at least we now have an approach which is less obviously flawed.

ptersilie commented 6 years ago

One quick comments for the PR text (I'll review the code later). Maybe a bit nit-picky. But it's a bit confusing that terminals in your example grammars are capitalised while the input is lower case. So I'm assuming there's a hidden lexer somewhere with rules like A: a, B: b, etc. because otherwise none of your input would parse at all. ;-)

ltratt commented 6 years ago

Point taken. Would you like me to fix that before you start reviewing?

ptersilie commented 6 years ago

Nah, that's alright. I managed to decipher it. Just for future PRs I think we should agree on capitalised nonterminals and lowercase terminals as is the standard I think.

ltratt commented 6 years ago

Agreed. I'll add a commit which does that and squash it in at the end if you're OK with that.

ltratt commented 6 years ago

I looked at the uppercase/lowercase thing in more detail, got half way through making the necessary changes, and then realised there's more to it than I first thought. Our convention is that terminal types are capitalised e.g. [0-9]+ INT. So in that sense it makes sense that one has a A as a lexer rule. Or, in other words, there's a good argument that the behaviour in the PR matches our convention already...

Any thoughts?

ptersilie commented 6 years ago

The thing is you didn't provide any lexing rules with it. Which made me assume that the terminals don't have any. If a terminal used in the grammar doesn't have a lexing rule, Yacc automatically creates one that matches exactly that terminal, e.g. a grammar like

S := "a" "b"

without any lexing rules, will still happily parse "ab" as input.

ptersilie commented 6 years ago

I'm having trouble to read the commit message in https://github.com/softdevteam/lrpar/pull/40/commits/ba35e6aed3ae8fc8be8d9d94901a03aff600ae26

What is this supposed to mean: q_i \xrightarrow[X] \wedge dist(q, a_p) < \infty? I wonder if there is a better way to render this properly without having to copy this into a latex document?

ptersilie commented 6 years ago

Only some high-level comments. Rest looks fine.

ltratt commented 6 years ago

I wonder if there is a better way to render this properly without having to copy this into a latex document?

I don't think so :/ Unfortunately it's not uncommon to see people use this in emails and so on. It's not ideal, but when plain text is the interchange mechanism, I don't know anything better I'm afraid.

ltratt commented 6 years ago

I just noticed a definite "oops" which https://github.com/softdevteam/lrpar/pull/40/commits/a260c9067156f97ef941111515f06d722068f872 fixes. The new dist() algorithm incorporates KimYi's dist algorithm, but I'd needlessly left two copies of it in the code! Of course, this was semantically correct, but not exactly very useful. The commit simply removes what is effectively dead (or, more accurately, pointless) code. Sorry for not spotting this when I raised the PR!

ptersilie commented 6 years ago

Okay, I'm happy with your replies. One thing we haven't agreed on yet is the terminals in your examples (see https://github.com/softdevteam/lrpar/pull/40#issuecomment-358980389). I still believe that if you don't provide lexing rules, the terminals should be lower case so that your examples still work (even without lexing rules).

ltratt commented 6 years ago

Right, so I think I should squash and a) address a couple of the typos you found b) update the commit messages with lower-case token names. Sound like a plan?

ptersilie commented 6 years ago

Sounds good to me. :)

ltratt commented 6 years ago

OK, squashed and (hopefully) comments fixed!

ptersilie commented 6 years ago

Think you forgot this one put 25% seems not uncommon.

ltratt commented 6 years ago

Oops -- I forgot the other typo in that commit message too. Mea culpa! Now fixed (hopefully).

ptersilie commented 6 years ago

Good. Will merge once tests are complete.