softdevteam / lrpar

Rust LR parser
Other
1 stars 0 forks source link

Implement the KimYi algorithm #32

Closed ltratt closed 6 years ago

ltratt commented 6 years ago

The first 4 commits mostly just move code around semi-mechanically: the final commit is where the action lies. It is quite big, unfortunately, but it's really difficult to split it into smaller bits.

After playing with this for weeks, I think I've understood the paper well enough to derive a fairly faithful implementation (bearing in mind that the paper is somewhat incomplete, and frequently confusing). As the paper suggests, this only finds a single minimal cost solution to each parser error, so it's highly non-deterministic: sometimes the minimal cost solution it comes up with is good, and sometimes it isn't. It is, however, pretty fast, certainly compared to the Corchuelo et al. algorithm. The biggest extension relative to the paper is that they really only care about finding a repair: they don't seem to care about reporting that to the user. The repairs thus reference non-terminals in the grammar which, to most users, are completely incomprehensible. This commit uses cfgrammar's SentenceGenerator to turn a non-terminal repair into a (possible set) of terminal inserts, which is infinitely easier to understand. Thus whilst KimYi might say a repair is:

  InsertNonterm expr

this commit will say something like:

  InsertTerm {Var, Identifier}, InsertTerm{+, -, *}, InsertTerm {Var, Identifier}

There are several glaring inefficiencies in this commit: notably, we continually make new SentenceGenerators. That's because, currently, we don't really have a good way for passing state between, and within, error recovery instances. However, that's a more invasive change which can come later.

To test this on a real Java file use something like:

$ cargo run --release java.l java.y <java file with errors in>

which will print the parse tree and then repairs e.g.:

Error detected at line 56 col 13. Amongst the valid repairs are:
  Insert "EQ"
Error detected at line 56 col 62. Amongst the valid repairs are:
  Insert {INTEGER_LITERAL, FLOATING_POINT_LITERAL, BOOLEAN_LITERAL, CHARACTER_LITERAL, STRING_LITERAL, NULL_LITERAL, THIS, IDENTIFIER}, Keep ")", Insert "SEMICOLON"
Error detected at line 1137 col 41. Amongst the valid repairs are:
  Insert "AND", Keep "to", Insert "QUESTION"

The big difference relative to the Corchuelo et al. algorithm is that you can end up with "choose from a set". e.g. the second repair says "choose any of {integer, float, ..., identifier}".

A future PR will implement what I'm tentatively calling the KimYi+ algorithm, which is basically their algorithm plus various ideas I've collected to make it more useful to humans. That might slightly depend on how https://github.com/samueltardieu/pathfinding/pull/44 ends up going.

ptersilie commented 6 years ago

Rest looks good. Ready for merging once squashed.

ltratt commented 6 years ago

OK, I'm going to see if I can undo the damage I did in the last rebase...

ltratt commented 6 years ago

OK, I think I've moved everything to the right place now. [Thank goodness for the "edit" option in git rebase!] If you agree, please feel free to merge.

ptersilie commented 6 years ago

Merged!