Closed ltratt closed 6 years ago
Rest looks good. Ready for merging once squashed.
OK, I'm going to see if I can undo the damage I did in the last rebase...
OK, I think I've moved everything to the right place now. [Thank goodness for the "edit" option in git rebase!] If you agree, please feel free to merge.
Merged!
The first 4 commits mostly just move code around semi-mechanically: the final commit is where the action lies. It is quite big, unfortunately, but it's really difficult to split it into smaller bits.
After playing with this for weeks, I think I've understood the paper well enough to derive a fairly faithful implementation (bearing in mind that the paper is somewhat incomplete, and frequently confusing). As the paper suggests, this only finds a single minimal cost solution to each parser error, so it's highly non-deterministic: sometimes the minimal cost solution it comes up with is good, and sometimes it isn't. It is, however, pretty fast, certainly compared to the Corchuelo et al. algorithm. The biggest extension relative to the paper is that they really only care about finding a repair: they don't seem to care about reporting that to the user. The repairs thus reference non-terminals in the grammar which, to most users, are completely incomprehensible. This commit uses cfgrammar's SentenceGenerator to turn a non-terminal repair into a (possible set) of terminal inserts, which is infinitely easier to understand. Thus whilst KimYi might say a repair is:
this commit will say something like:
There are several glaring inefficiencies in this commit: notably, we continually make new SentenceGenerators. That's because, currently, we don't really have a good way for passing state between, and within, error recovery instances. However, that's a more invasive change which can come later.
To test this on a real Java file use something like:
which will print the parse tree and then repairs e.g.:
The big difference relative to the Corchuelo et al. algorithm is that you can end up with "choose from a set". e.g. the second repair says "choose any of {integer, float, ..., identifier}".
A future PR will implement what I'm tentatively calling the KimYi+ algorithm, which is basically their algorithm plus various ideas I've collected to make it more useful to humans. That might slightly depend on how https://github.com/samueltardieu/pathfinding/pull/44 ends up going.