ikatyang / tree-sitter-yaml

YAML grammar for tree-sitter
https://ikatyang.github.io/tree-sitter-yaml/
MIT License
94 stars 38 forks source link

Errors #15

Open char0n opened 3 years ago

char0n commented 3 years ago

Hi @ikatyang,

I've been playing with how Errors and tree-sitter recovery mechanism work in this grammar and found some interesting cases that I'm not sure how to handle.

---
openapi: 3.1.0
x-top-level: value
info:
   title: Sample API
  unknownFixedField: value
  description: Optional multiline or single-line description in [CommonMark](http://commonmark.org/help/)
    or HTML.
  summary: example summary
  termsOfService: Terms of service
  version::: 0.1.9
  x-version: 0.1.9-beta
  license:
    name: Apache License 2.0
    x-fullName: Apache License 2.0
    identifier::: Apache License 2.0
    url: https://www.apache.org/licenses/LICENSE-2.0 

produces

ERROR [0, 0] - [4, 20])
  block_mapping_pair [1, 0] - [1, 14])
    key: flow_node [1, 0] - [1, 7])
      plain_scalar [1, 0] - [1, 7])
        string_scalar [1, 0] - [1, 7])
    value: flow_node [1, 9] - [1, 14])
      plain_scalar [1, 9] - [1, 14])
        string_scalar [1, 9] - [1, 14])
  block_mapping_pair [2, 0] - [2, 18])
    key: flow_node [2, 0] - [2, 11])
      plain_scalar [2, 0] - [2, 11])
        string_scalar [2, 0] - [2, 11])
    value: flow_node [2, 13] - [2, 18])
      plain_scalar [2, 13] - [2, 18])
        string_scalar [2, 13] - [2, 18])
  flow_node [3, 0] - [3, 4])
    plain_scalar [3, 0] - [3, 4])
      string_scalar [3, 0] - [3, 4])
  block_mapping_pair [4, 3] - [4, 20])
    key: flow_node [4, 3] - [4, 8])
      plain_scalar [4, 3] - [4, 8])
        string_scalar [4, 3] - [4, 8])
    value: flow_node [4, 10] - [4, 20])
      plain_scalar [4, 10] - [4, 20])
        string_scalar [4, 10] - [4, 20])

What is happening here is that parser doesn't parse after title: Sample API. This makes it pretty unusable in for example Editor cases. Only part of the source string is parsed and the rest is either ignored or consumed by parser. I would expect CST to contain at least another Error object with additional children nodes.

Did you ever bumped into this issue?

Thanks for any answer

char0n commented 3 years ago

Here is an example how tree-sitter-json produces it's CST.

Given

{a*

The resulting CST would be:

   Error
     Literal - {
     Error - a*

As we can see tree-sitter-json parses entire source string and creates CST of Errors and Literals covering entire source string.

ikatyang commented 3 years ago

The issue here is that I did not use the internal lexer, and I only produces valid tokens, so that there is no token produced if it encountered invalid state. And to increase the parsing speed, I intentionally wrote the lexer with avoiding forking the tree in mind, which in practice combines multiple tokens into ones, that's the reason why error recovery does not work here. I guess this issue can be fixed by producing tokens more granularly and loosely though I'm not sure what the performance impact will be, I'll see what I can do.

char0n commented 3 years ago

@ikatyang thanks for describing the problem. That explains a lot.

Error recovery and parsing the entire source string are two most important things that we were looking for with tree-sitter. I do have couple of additional questions, if you don't mind

1. Is there any chance this grammar will choose to use internal tree sitter lexer in future? 2. How complicated it would be to fork this grammar and use internal tree sitter lexer?

I'm not sure what the performance impact will be

I think people use tree-sitter for two reasons - unmatched error recovery ability and performance. Unfortunately this grammar is killing error recovery in favor of performance.

I'll see what I can do.

Thank you very much. We already have a very ugly workaround for this by creating surrogate ERROR node with unparsed source string. If in the first iteration we can make this grammar parse entire string it would be a grand start.

ikatyang commented 3 years ago
  1. Is there any chance this grammar will choose to use internal tree sitter lexer in future?

The external lexer must be involved since there some tokens that are not possible to be described by the grammar.js, e.g., the conceptual token (i.e., indent/dedent/etc.) and the lookahead token (i.e., (?=x)/(?!x) in regex). The only question is the strategy of using internal/external lexer, this repo is currently using 100% external lexer because of the better controllability, but it does not seem to be necessary after second thought, I'll probably switch to only use external lexer if necessary to take advantage of features provided by the internal lexer, i.e., better error recovery and machine-generated tokenizer.

  1. How complicated it would be to fork this grammar and use internal tree sitter lexer?

It'd be basically a new project with only reusable test cases IMO.

Unfortunately thing grammar is killing error recovery in favor of performance.

Sorry, I forgot to confirm if it works well in invalid state, I only checked if it shows errors in error state since I originally want to use it as an AST parser 😅.

If in the first iteration we can make this grammar parse entire string it would be a grand start.

A quick workaround would be to treat those invalid indent/dedent tokens as valid ones that make more sense in that position, though it'd result in treating invalid tokens as valid ones, is it OK for you? If so, I could create a PR for it, and you could build from that branch since I won't merge it as it's just a quick workaround with some downsides.

char0n commented 3 years ago

A quick workaround would be to treat those invalid indent/dedent tokens as valid ones that make more sense in that position, though it'd result in treating invalid tokens as valid ones, is it OK for you? If so, I could create a PR for it, and you could build from that branch since I won't merge it as it's just a quick workaround with some downsides.

No worries, we already have a quick & dirty workaround (creating surrogate errors). No need for another one. We'll wait for some systematic solution on master branch.

char0n commented 3 years ago

Hi @ikatyang,

Just pinging if there is any plan for this issue in some foreseeable feature

ikatyang commented 3 years ago

It's still on my TODO list but I do not have time to work on it recently so I cannot give you an ETA for it.