Closed dpren closed 6 years ago
Hey @dpren, it'd be great to have another set of 👀 on this parser. Glad you're trying it out!
Could you explain the behavior you expect in a bit more detail? A better error recovery?
It looks like right now, the parser can't detect an error until the end of file, because it sees the second =
token as a valid variable_operator
.
@rewinfrey Is it correct that a single =
can be used as an operator? I think we might want to restrict the definition of variable_symbol
so that this won't happen.
If I hack that change in, then the parser can detect the error a bit earlier, at the second =
instead of at the EOF. We then get a better recovery:
(module [0, 0] - [3, 0]
(function_declaration [0, 0] - [2, 9]
(variable_identifier [0, 0] - [0, 5])
(ERROR [0, 6] - [2, 5]
(function_application [0, 9] - [2, 5]
(parenthesized_expression [0, 9] - [0, 12]
(integer [0, 10] - [0, 11]))
(variable_identifier [2, 0] - [2, 5])))
(function_body [2, 6] - [2, 9]
(integer [2, 8] - [2, 9]))))
Basically, we treat the = ((1) test
as an error and parse the remainder as a single declaration. That's now a pretty reasonable recovery, seeing as how the error cannot theoretically be detected until the second =
.
@dpren For some context on Tree-sitter's error recovery algorithm - It currently works by skipping one or more tokens and inserting at most one token, in some sequence that overlaps or is adjacent to the error detection point. Tree-sitter tries several ways of recovering from the error and chooses the one that ends up minimizing a pre-defined 'cost' metric.
Because it uses LR parsing, Tree-sitter always detects an error at the earliest possible point in the file, which in this case is the second =
sign. So it can only perform sequences of skipping / insertion operations that touch that token. That's why it can't simply skip the redundant open-paren or insert a closing paren to match it: those operations are not adjacent to the point of error detection, so they would be prohibitively expensive to consider.
@maxbrunsfeld Thanks for the detailed reply! I'm looking for better recovery. For context, I'm considering tree-sitter for an IDE focused parser for the Elm language.
This last section "Recovery" of this HN comment sums up what I have in mind: https://news.ycombinator.com/item?id=13918175
Basically, when there's an error - use indentation level as a heuristic to resume at the next declaration.
I'm guessing tree-sitter can't support this sort of thing? I could always re-parse sections manually, but at that point I wonder if I should just be using a different tool or going handwritten.
Basically, when there's an error - use indentation level as a heuristic to resume at the next declaration.
Yeah, Tree-sitter doesn't do this. I don't think that would be a helpful approach in this particular case anyway, because the error would not be detected until the middle of the second declaration, unless I'm missing something.
I understand this is WIP - but I noticed missing closing parens cause problems. Is this something that needs work in the scanner? I'd be willing to look into it.