Open sunjay opened 4 years ago
After #58, we added parser combinators and implemented a simple MVP parser using that. An issue that came up is that it's really difficult to tell where to produce an error, and where to discard an error. Parser combinators, as implemented today, just don't encode enough information to detect that.
For example, consider the many0
and opt
parsers.
We had to hand-roll a few parsers because of the limitations around error handling. Specifically, block
and cond
both need at least 2 look-ahead to figure out whether they should continue or not. block
actually needs some very specialized handling because it can't tell whether it has a statement or the final expression until the end of an expression (which might be any number of tokens). In parser generators that can create parse tables, there are no such issues because they can just parse the expression, then check for a semi-colon to determine whether it was a statement or an expression. The classification is left until the end, not encoded in the control flow at the start. For cond
, the situation is simpler because we just need 2 token lookahead: if we get else if
, we parse the condition and block, otherwise, if we get else {
we just parse the block.
The opt
combinator is fairly heavy handed. It never returns an error. Either it completely succeeds and returns Some(...)
, or it completely fails and returns None
. This was necessary because of the same limitations that caused many0
to be unusable in some cases. We can't encode the number of tokens that are allowed to fail (i.e. the "look-ahead"), so we have to make opt
like this in order for it to be useful at all. If opt
returned an error in the same situation that many0
returns an error, the combinator would become unusable in a bunch of different situations.
The nom
crate tries to solve this using the cut
combinator. This is a combinator that wraps any parser and makes it produce a "fatal" error. nom
's error type has two variants: Error
and Failure
. Error
is a recoverable error, and Failure
is an unrecoverable error. The idea, is that if a parser sees Failure
, it's supposed to just always return that. This is what nom
's many0
and opt
combinators use to distinguish cases that are fine from cases that should produce an error. I don't think that the cut
approach really works in general. For one thing, it requires you to manually annotate the cases that are fatal errors. This is error-prone and easy to mess up or forget. Another problem is that an error being "unrecoverable" is highly-context dependent. For example, when parsing if
, you know that the condition and block are required, so should they be an unrecoverable error? No, because if they were, other combinators like many0
and alt
would always return that error, even if another combinator could succeed in that place. If you really think about it, it's very hard to put cut
in a place where it's guaranteed to always work the way you expect.
This is one of the "fatal flaws" of parser combinators. I think they just don't encode enough information about what cases can be recovered from in which situations. In a way, I think it's because we're encoding the wrong information. It's the difference between imperative programming and declarative programming. With combinators, we're imperatively telling the computer to test various parses. With a more declarative style, we could encode the information we want to test, and do more clever things with look-ahead/recovery at runtime.
This point might be counter-intuitive. Parser combinators look very declarative! Indeed, I think the imperative nature of combinators has more to do with our implementation of them, not the way we write them per se. It might be interesting to experiment with versions of the combinators that return structs instead of functions. For example, the many0
combinator could return a Many0<P>
struct that is generic over its inner parser. If we can turn the information encoded by the combinators into data, maybe we can process that data and essentially create a parser generator written in combinator style. With that additional data, we could potentially generate parse tables and do error recovery automatically.
No idea if any of this would actually work, but if it does, we would get a parser combinator library with all the benefits of a parser generator.
If this is so great, why haven't people done it yet?
They probably have. I haven't looked into it.
Also, I think it hasn't been done because it's probably pretty hard to make this generic. I think the only reason this might work here is because we're able to specialize the code to our representation of tokens. Making this a general library might be hard. Being able to play around with specialized things like this is part of the whole reason why I'm doing this project.
Ideas to look into: