tree-sitter / tree-sitter-haskell

Haskell grammar for tree-sitter.
MIT License
157 stars 36 forks source link

unmatched parentheses #14

Closed dpren closed 6 years ago

dpren commented 6 years ago

I understand this is WIP - but I noticed missing closing parens cause problems. Is this something that needs work in the scanner? I'd be willing to look into it.

test = ((1)

test = 2
(module [0, 0] - [3, 0]
  (ERROR [0, 0] - [2, 8]
    (variable_identifier [0, 0] - [0, 4])
    (infix_operator_application [0, 8] - [2, 8]
      (function_application [0, 8] - [2, 4]
        (parenthesized_expression [0, 8] - [0, 11]
          (integer [0, 9] - [0, 10]))
        (variable_identifier [2, 0] - [2, 4]))
      (variable_operator [2, 5] - [2, 6]
        (variable_symbol [2, 5] - [2, 6]))
      (integer [2, 7] - [2, 8]))))
maxbrunsfeld commented 6 years ago

Hey @dpren, it'd be great to have another set of 👀 on this parser. Glad you're trying it out!

Could you explain the behavior you expect in a bit more detail? A better error recovery?

It looks like right now, the parser can't detect an error until the end of file, because it sees the second = token as a valid variable_operator.

@rewinfrey Is it correct that a single = can be used as an operator? I think we might want to restrict the definition of variable_symbol so that this won't happen.

If I hack that change in, then the parser can detect the error a bit earlier, at the second = instead of at the EOF. We then get a better recovery:

(module [0, 0] - [3, 0]
  (function_declaration [0, 0] - [2, 9]
    (variable_identifier [0, 0] - [0, 5])
    (ERROR [0, 6] - [2, 5]
      (function_application [0, 9] - [2, 5]
        (parenthesized_expression [0, 9] - [0, 12]
          (integer [0, 10] - [0, 11]))
        (variable_identifier [2, 0] - [2, 5])))
    (function_body [2, 6] - [2, 9]
      (integer [2, 8] - [2, 9]))))

Basically, we treat the = ((1) test as an error and parse the remainder as a single declaration. That's now a pretty reasonable recovery, seeing as how the error cannot theoretically be detected until the second =.

maxbrunsfeld commented 6 years ago

@dpren For some context on Tree-sitter's error recovery algorithm - It currently works by skipping one or more tokens and inserting at most one token, in some sequence that overlaps or is adjacent to the error detection point. Tree-sitter tries several ways of recovering from the error and chooses the one that ends up minimizing a pre-defined 'cost' metric.

Because it uses LR parsing, Tree-sitter always detects an error at the earliest possible point in the file, which in this case is the second = sign. So it can only perform sequences of skipping / insertion operations that touch that token. That's why it can't simply skip the redundant open-paren or insert a closing paren to match it: those operations are not adjacent to the point of error detection, so they would be prohibitively expensive to consider.

dpren commented 6 years ago

@maxbrunsfeld Thanks for the detailed reply! I'm looking for better recovery. For context, I'm considering tree-sitter for an IDE focused parser for the Elm language.

This last section "Recovery" of this HN comment sums up what I have in mind: https://news.ycombinator.com/item?id=13918175

Basically, when there's an error - use indentation level as a heuristic to resume at the next declaration.

I'm guessing tree-sitter can't support this sort of thing? I could always re-parse sections manually, but at that point I wonder if I should just be using a different tool or going handwritten.

maxbrunsfeld commented 6 years ago

Basically, when there's an error - use indentation level as a heuristic to resume at the next declaration.

Yeah, Tree-sitter doesn't do this. I don't think that would be a helpful approach in this particular case anyway, because the error would not be detected until the middle of the second declaration, unless I'm missing something.