ionide / tree-sitter-fsharp

F# grammar for treesitter
MIT License
81 stars 18 forks source link

Error while parsing type definition #72

Closed KaranAhlawat closed 1 month ago

KaranAhlawat commented 1 month ago

Describe the bug Incorrect parsing of type definition when preceded by an application expression in a module

To Reproduce

module A = 
    printfn "Hi"

    type B = | C | D

Using this code snippet, the tree-produced is

(file
 (module_defn module (identifier) =
  block: 
   (sequential_expression
    (application_expression
     (long_identifier_or_op (identifier))
     (const
      (string " ")))
    ;
    (infix_expression
     (application_expression
      (application_expression
       (long_identifier_or_op (identifier))
       (long_identifier_or_op (identifier)))
      (ERROR = |)
      (long_identifier_or_op (identifier)))
     (infix_op)
     (long_identifier_or_op
      (long_identifier (identifier)))))))

Expected behaviour Expected the type to be parsed as normally, which would be something like this

(file
 (module_defn module (identifier) =
  block: 
   (type_definition type
    (union_type_defn
     (type_name type_name: (identifier))
     =
     block: 
      (union_type_cases |
       (union_type_case (identifier))
       |
       (union_type_case (identifier)))))))

Screenshots With the application expression image

Without the application expression image

Environment (please complete the following information):

KaranAhlawat commented 1 month ago

If possible, I'd like to contribute a fix (if it's technically feasible), but I really have no clue where to start looking or how the debugging process would work.

Nsidorenco commented 1 month ago

Hi! Thank you for the nicely detailed error report and that you're interested in fixing it!

The best bet at debugging the parser is to use the npm run debug $FILE command, which will print every step of the parser.

To effectively use it, however, requires an overview of the parsing process.

Here, it is important to know there are two steps to the parsing process; the external and internal. The external scanner is defined in src/scanner.c and is a hand-crafted parser in c. The external scanner is alwaly run first. It read one character at a time, and decides if want to continue reading, return a token, or give up. If the external scanner does not return a token, then the internal scanner is run. The internal scanner is the parser that is generated from the grammar.js file.

If we run npm run debug on the same you provided, we will at some point see the following in the output:

process version:0, version_count:1, state:1541, row:1, col:16
lex_external state:4, row:1, column:16
  skip character:10
  skip character:10
  skip character:' '
  skip character:' '
  skip character:' '
  skip character:' '
  consume character:'t'
lexed_lookahead sym:_newline, size:0
reduce sym:_string_literal, child_count:3
reduce sym:string, child_count:1
reduce sym:const, child_count:1
reduce sym:_expression, child_count:1
reduce sym:application_expression, child_count:2
reduce sym:_expression, child_count:1
shift state:640
process version:0, version_count:1, state:640, row:1, col:16
lex_external state:2, row:1, column:16
  skip character:10
  skip character:10
  skip character:' '
  skip character:' '
  skip character:' '
  skip character:' '
lex_internal state:347, row:1, column:16
  skip character:10
  skip character:10
  skip character:' '
  skip character:' '
  skip character:' '
  skip character:' '
  consume character:'t'
  consume character:'y'
  consume character:'p'
  consume character:'e'
  consume character:'t'
  consume character:'y'
  consume character:'p'
  consume character:'e'
lexed_lookahead sym:identifier, size:10

This is the parsing steps immediately following printfn "Hi". Where we can tell from:

reduce sym:_string_literal, child_count:3
reduce sym:string, child_count:1
reduce sym:const, child_count:1
reduce sym:_expression, child_count:1
reduce sym:application_expression, child_count:2
reduce sym:_expression, child_count:1

that we correctly parsed printfn "Hi" into an application_expression.

We can then tell by the first lines:

lex_external state:4, row:1, column:16
  skip character:10
  skip character:10
  skip character:' '
  skip character:' '
  skip character:' '
  skip character:' '
  consume character:'t'
lexed_lookahead sym:_newline, size:0

That the external scanner consumed the newline characters and spaces up until the t of the type keyword and that resulted in a _newline token.

The _newline token is used by the sequential_expression in the grammar to check that two expressions start at the same level of indentation.

What we really wanted to happen at this point, is to parse the type as the start of a type_definition.

Both sequential_expression and type_definition are valid nodes in the module_elem node, which is the top-level node containing all elements of a module.

However, since the external scanner produced a _newline token, and neither module_elem nor type_definition accepts a _newline token the parser is forced to try and parse the rest of the file as the child node of a sequential_expression.

My idea would to handle _newline tokens in the module definition akin to sequential_expression. i.e. every sub-node but the first should have a _newline token before it. Then there will the probable precedence issues that will appear between _module_elem and sequential_expression that would have to be fixed to make it all correct.