https://github.com/tree-sitter/tree-sitter/issues/224

The basic idea is that the parser accepts any string. If the string does not match the grammar, the parser corrects it to match the grammar. Currently, there are two types of corrections:

Skip some number of tokens or subtrees

Insert one missing token

Obviously, for a given syntax error, there are multiple combinations of corrections that would work. The parser handles this decision similarly to how ambiguities are handled by the classic GLR algorithm: the parse stack forks into multiple branches, and tries a different possibility on each branch.

We maintain two quantities associated with each branch of the parse stack:

error_cost - an integer measure of the number of subtrees skipped, the total size of the subtrees skipped, and the number of missing tokens inserted.

node_count - the number of valid syntax tree nodes that have been added, since the last error.

There is some heuristic logic for deciding when to discard branches of the parse stack based on their error cost and their node count. When the node count is small, it means that we've just recently encountered an error, so we want to allow a fairly large number of different branches to "compete", to see which one will turn out the best. When the node count is large, it means we have progressed far past the error, so we should allow very few different branches (eventually just one), in order to restore good parsing performance.

How do you decide which branches to prune when the node count gets bigger?

Basically, we keep branches with a lower error cost, and discard branches with a higher error cost. And the threshold for discarding branches becomes tighter and tighter as the node count increases, so after we parse for a little while, we end up with only one branch again.

I'd guess additional errors on an error-recovery branch usually mean it's a dead end, no?

Yeah, this ends up falling out of our strategy, because additional errors add to a branch's error cost.

After #58, we added parser combinators and implemented a simple MVP parser using that. An issue that came up is that it's really difficult to tell where to produce an error, and where to discard an error. Parser combinators, as implemented today, just don't encode enough information to detect that.

For example, consider the many0 and opt parsers.

We had to hand-roll a few parsers because of the limitations around error handling. Specifically, block and cond both need at least 2 look-ahead to figure out whether they should continue or not. block actually needs some very specialized handling because it can't tell whether it has a statement or the final expression until the end of an expression (which might be any number of tokens). In parser generators that can create parse tables, there are no such issues because they can just parse the expression, then check for a semi-colon to determine whether it was a statement or an expression. The classification is left until the end, not encoded in the control flow at the start. For cond, the situation is simpler because we just need 2 token lookahead: if we get else if, we parse the condition and block, otherwise, if we get else { we just parse the block.

The opt combinator is fairly heavy handed. It never returns an error. Either it completely succeeds and returns Some(...), or it completely fails and returns None. This was necessary because of the same limitations that caused many0 to be unusable in some cases. We can't encode the number of tokens that are allowed to fail (i.e. the "look-ahead"), so we have to make opt like this in order for it to be useful at all. If opt returned an error in the same situation that many0 returns an error, the combinator would become unusable in a bunch of different situations.

The nom crate tries to solve this using the cut combinator. This is a combinator that wraps any parser and makes it produce a "fatal" error. nom's error type has two variants: Error and Failure. Error is a recoverable error, and Failure is an unrecoverable error. The idea, is that if a parser sees Failure, it's supposed to just always return that. This is what nom's many0 and opt combinators use to distinguish cases that are fine from cases that should produce an error. I don't think that the cut approach really works in general. For one thing, it requires you to manually annotate the cases that are fatal errors. This is error-prone and easy to mess up or forget. Another problem is that an error being "unrecoverable" is highly-context dependent. For example, when parsing if, you know that the condition and block are required, so should they be an unrecoverable error? No, because if they were, other combinators like many0 and alt would always return that error, even if another combinator could succeed in that place. If you really think about it, it's very hard to put cut in a place where it's guaranteed to always work the way you expect.

This is one of the "fatal flaws" of parser combinators. I think they just don't encode enough information about what cases can be recovered from in which situations. In a way, I think it's because we're encoding the wrong information. It's the difference between imperative programming and declarative programming. With combinators, we're imperatively telling the computer to test various parses. With a more declarative style, we could encode the information we want to test, and do more clever things with look-ahead/recovery at runtime.

This point might be counter-intuitive. Parser combinators look very declarative! Indeed, I think the imperative nature of combinators has more to do with our implementation of them, not the way we write them per se. It might be interesting to experiment with versions of the combinators that return structs instead of functions. For example, the many0 combinator could return a Many0<P> struct that is generic over its inner parser. If we can turn the information encoded by the combinators into data, maybe we can process that data and essentially create a parser generator written in combinator style. With that additional data, we could potentially generate parse tables and do error recovery automatically.

No idea if any of this would actually work, but if it does, we would get a parser combinator library with all the benefits of a parser generator.

Q&A

If this is so great, why haven't people done it yet?

They probably have. I haven't looked into it.

Also, I think it hasn't been done because it's probably pretty hard to make this generic. I think the only reason this might work here is because we're able to specialize the code to our representation of tokens. Making this a general library might be hard. Being able to play around with specialized things like this is part of the whole reason why I'm doing this project.

sunjay / dino

Error Recovery in Parser #57

Q&A