swagger-api / apidom

Semantic parser for API specifications
https://swagger-api.github.io/apidom/
69 stars 18 forks source link

Make error recovery in YAML adapters better #194

Open char0n opened 3 years ago

char0n commented 3 years ago
---
openapi: 3.1.0
x-top-level: value
info:
  title: Sample API
  unknownFixedField: value
  description: Optional multiline or single-line description in [CommonMark](http://commonmark.org/help/)
    or HTML.
  summary: example summary
  termsOfService: Terms of service
  version::: 0.1.9
  x-version: 0.1.9-beta
  license:
    name: Apache License 2.0
    x-fullName: Apache License 2.0
    identifier::: Apache License 2.0
      url: https://www.apache.org/licenses/LICENSE-2.0

Following YAML fragment produces ERROR node with children of valid map, sequence or scalar.

image

Ref https://github.com/tree-sitter/tree-sitter/issues/2339 Ref https://github.com/ikatyang/tree-sitter-yaml/issues/50 Ref https://github.com/swagger-api/swagger-editor/issues/4217 Ref https://github.com/tree-sitter-grammars/tree-sitter-yaml/issues/14

char0n commented 3 years ago

We first need to be clear about https://github.com/ikatyang/tree-sitter-yaml/issues/15 (replaced by https://github.com/tree-sitter-grammars/tree-sitter-yaml/issues/14) before we proceed any further.

frantuma commented 2 years ago

@char0n

We are at a point where we'd need if possible to work around this the best way we can, I recap here the thing and the possible discussed workarounds, hoping we can at least mitigate the issue.

Some references

and specifically this comment

Recap from my notes:

The issue to recap:

e.g:

asyncapi: '2.2.0'
info:
  tit
  version: '1.0.0'

(this typically happens while editing)

passing the above to the parser, results currently in just an error with no nodes (both with and without the empty plugin):

ParseResult {
  _content: [
    Annotation {
      _content: "(Error asyncapi: '2.2.0'\ninfo:\n  tit\n  version)",
      _storedElement: 'annotation',
      _meta: [ObjectElement]
    }
  ],
  _storedElement: 'parseResult'
}

a "similar" JSON with error results instead in a "good" parse result with error annotations but with the tree populated.

Now looking at the tree-sitter YAML parse result, it seems that actually it provides a parse tree (for what it manages, i.e. up to the location of the error), but "wrapped" in an ERROR node:

  ERROR [0:0-3:9]
   block_mapping_pair [0:0-0:17]
    flow_node [0:0-0:8]
     plain_scalar [0:0-0:8]
      string_scalar [0:0-0:8]
    : [0:8-0:9]
    flow_node [0:10-0:17]
     single_quote_scalar [0:10-0:17]
      ' [0:10-0:11]
      ' [0:16-0:17]
   block_mapping_pair [1:0-3:9]
    flow_node [1:0-1:4]
     plain_scalar [1:0-1:4]
      string_scalar [1:0-1:4]
    : [1:4-1:5]
    flow_node [2:2-3:9]
     plain_scalar [2:2-3:9]
      string_scalar [2:2-3:9]

while I guess CstVisitor "stops" when encountering the ERROR node: https://github.com/swagger-api/apidom/blob/2984f59f5e7cb1f2693883766051e7b0c45f94b4/packages/apidom-parser-adapter-yaml-1-2/src/syntactic-analysis/visitors/CstVisitor.ts#L614-L629

We like to get whatever we can instead of nothing

Possible workarounds

1. creating surrogate document from top level Error

This seems possibly the easiest way to go without relying on intervention by tree-sitter-yaml maintainer (or our intervention on that code). We would not get the entire tree, but at least we would get a partial result, which would suffice to overcome at least our issues in completion scenarios.

We discussed it and I guess we originally agree that it can be implemented, something like replacing the top level ERROR node with "surrogate document" so something like:

stream [0:0-3:0]
 document [0:0-3:0]
  block_node [0:0-3:0]
   block_mapping [0:0-3:0]

You have elaborated about this in this comment in slack and related replies, and I am not sure I get the conclusion.

In this comment you say:

Regarding way of creating surrogate document from top level Error: IMHO we have to revise our plan as that CST contains gibberish: keyValuePairs along with flow nodes. So in order to create something that is valid YAML AST I will only take keyValuePairs and create a surrogate YamlMapping from it, but all the flow nodes are going to be lost and not addressable in any way via ApiDOM.

I am not sure I understand your point, what would be exactly the problem, and would it stop us from obtaining a parse tree up to the point of the error?

I see the parsed tree structure stays the same (for the parsed part) as the result of a successful parsing of good YAML, e.g. parsing:

asyncapi: '2.2.0'
info:
  tittle: aa
  version: '1.0.0'

produces:

stream [0:0-4:0]
 document [0:0-4:0]
  block_node [0:0-4:0]
   block_mapping [0:0-4:0]
    block_mapping_pair [0:0-0:17]
     flow_node [0:0-0:8]
      plain_scalar [0:0-0:8]
       string_scalar [0:0-0:8]
     : [0:8-0:9]
     flow_node [0:10-0:17]
      single_quote_scalar [0:10-0:17]
       ' [0:10-0:11]
       ' [0:16-0:17]
    block_mapping_pair [1:0-4:0]
     flow_node [1:0-1:4]
      plain_scalar [1:0-1:4]
       string_scalar [1:0-1:4]
     : [1:4-1:5]
     block_node [2:2-4:0]
      block_mapping [2:2-4:0]
       block_mapping_pair [2:2-2:11]
        flow_node [2:2-2:7]
         plain_scalar [2:2-2:7]
          string_scalar [2:2-2:7]
        : [2:7-2:8]
        flow_node [2:9-2:11]
         plain_scalar [2:9-2:11]
          string_scalar [2:9-2:11]
       block_mapping_pair [3:2-3:18]
        flow_node [3:2-3:9]
         plain_scalar [3:2-3:9]
          string_scalar [3:2-3:9]
        : [3:9-3:10]
        flow_node [3:11-3:18]
         single_quote_scalar [3:11-3:18]
          ' [3:11-3:12]
          ' [3:17-3:18]

So question here is if we can proceed with the plan.

You also added this comment in slack:

I probably have a solution for tree-sitter-yaml grammar not generating Errors for entire source string - during CST -> AST transformation I can create a surrogate error that will contain the rest of the source string that was ignored by the parser. Source maps will have to be computed for that and the resulting string will be represented as a value of an Error

If I get this correctly, it's about getting some result for the remaining part of the doc (after the error), but not sure what would be the result, can you elaborate a bit more on it?

2. relying on workaround to be implemented in YAML grammar

I am referring to this comment:

A quick workaround would be to treat those invalid indent/dedent tokens as valid ones that make more sense in that position, though it'd result in treating invalid tokens as valid ones, is it OK for you? If so, I could create a PR for it, and you could build from that branch since I won't merge it as it's just a quick workaround with some downsides.

I guess this wouldn't solve the situation above (partial key with no :) but would much help in other scenarios. I think we could happily use that hack in some scenarios if maintainer could implement as he said, would you have any reason not to ask if can indeed add it?

Also not sure about your comment to his proposal:

No worries, we already have a quick & dirty workaround (creating surrogate errors). No need for another one. We'll wait for some systematic solution on master branch.

I believe this is related to what discussed above, but we don't have the workaround yet, correct?

char0n commented 2 years ago

Regarding way of creating surrogate document from top level Error: IMHO we have to revise our plan as that CST contains gibberish: keyValuePairs along with flow nodes. So in order to create something that is valid YAML AST I will only take keyValuePairs and create a surrogate YamlMapping from it, but all the flow nodes are going to be lost and not addressable in any way via ApiDOM.

As the grammar doesn't use internal tree-sitter lexer, in many cases it generates CST that is nonsensical. There are key value pairs without parent mappings (flow or block) etc... That is exactly what tree-sitter internal lexer is for - to probe paths of what author most probably intended with incorrect document and provide this most probable correct version CST tree in case of error. If we would try to create some smart algorithm to compensate for this, I'd rather fork the YAML grammar and reimplement it in proper way.

If you want to obtain the incorrect CST -> AST -> ApiDOM tree computed from the children of the top level error, this can be done by converting the top level Error to surrogate Document node and moving all Error node children as surrogate Document node children. This will result in ParseResultElement containing improperly constructed ApiDOM (having MemberElement directly as an item in ArrayElement without wrapping ObjectElement, etc...) and our tooling will most probably fail when operations are performed on ApiDOM like this.

We need to accept that tree-sitter-yaml is dead. The author doesn't have any interest in the grammar and the grammar it self is written in a way that causes all these issues we wouldn't need to handle, if the grammar was written in standard way, using internal mechanisms of tree-sitter.

I probably have a solution for tree-sitter-yaml grammar not generating Errors for entire source string - during CST -> AST transformation I can create a surrogate error that will contain the rest of the source string that was ignored by the parser. Source maps will have to be computed for that and the resulting string will be represented as a value of an Error

In some cases, the grammar stop parsing the source string and just consumes it. The solution here would be to create a surrogate Error that contains the consumed string as a child and compute the source maps. Then probably to represent this as ApiDOM we'd transform Error nodes like this into StringElement and Annotation(error=true).

A quick workaround would be to treat those invalid indent/dedent tokens as valid ones that make more sense in that position, though it'd result in treating invalid tokens as valid ones, is it OK for you? If so, I could create a PR for it, and you could build from that branch since I won't merge it as it's just a quick workaround with some downsides.

I have no idea what effect this will have on overall grammar behavior, it can possibly make the behavior of the grammar more tree-sitter like, but it's just theory... Author will not be doing this modification as he has no time and there has not been an answer from him since March 2021.

No worries, we already have a quick & dirty workaround (creating surrogate errors). No need for another one. We'll wait for some systematic solution on master branch.

This was addressing the issue with grammar consuming the rest of the source string. We can solve that on our side by workaround. I'm also suggesting here that making ad-hoc partial fixes is not a way to go, using internal tree-sitter lexer is...


Given all that's been discussed here and elsewhere, I think it's time to either produce our own grammar by forking https://github.com/ikatyang/tree-sitter-yaml or switch to different underlying parser for YAML. We're currently use dead YAML grammar, that's not giving us key benefits of tree-sitter and maintainer went completely silent.

Author/Maintainer also said this:

The external lexer must be involved since there some tokens that are not possible to be described by the grammar.js, e.g., the conceptual token (i.e., indent/dedent/etc.) and the lookahead token (i.e., (?=x)/(?!x) in regex). The only question is the strategy of using internal/external lexer, this repo is currently using 100% external lexer because of the better controllability, but it does not seem to be necessary after second thought, I'll probably switch to only use external lexer if necessary to take advantage of features provided by the internal lexer, i.e., better error recovery and machine-generated tokenizer.

So there are only two cases according to him where external lexer is needed. The rest can be handled by tree-sitter lexer. To achieve that we have to do mofidications in scanner.cc

frantuma commented 2 years ago

bumped on this one, moved to triaging not get it lost in new issues

ponelat commented 2 years ago

@frantuma @char0n Is it possible to at least enhance the error messaging around this? I have bumped into a few invalid YAML documents and have no idea how to fix them. Here is a contrived example...

asyncapi: 2.4.0
info: 
  version: '1.0.0'
   title: Something # Badly indented

The error message is the entire contents... (Error asyncapi: 2.4.0 info: version: '1.0.0')

How can we enhance this experience? Other YAML parsers will gives us the coordinates, so something like "Bad indentation on Line 4"

char0n commented 1 year ago

Here is another fragments that doesn't work in newest version web-tree-sitter@0.20.8 and tree-sitter-cli@0.20.8:

openapi: 3.1.0
info:
  summary: Update an existing pet
  desc
  title: test title
char0n commented 1 year ago

I've just issued https://github.com/swagger-api/apidom/pull/2881 to provide maximum error recovery for YAML parsing. For this I retained support for Node.js@18 and Node.js@20 (Node.js env), but used older versions of web-tree-sitter@0.20.3 and tree-sitter-cli@0.20.4 to generate WASM fragments (Browser env).

I also did deep research of tree-sitter-yaml. The grammar there is solid and time proven. Given that tree-sitter is built for context-free grammars and YAML is not context free grammar, the tree-sitter-yaml uses external scanner/lexer to provide compensation. Unfortunately this scanner becomes more and more incompatible with tree-sitter as it evolves.

What we need to do with tree-sitter-yaml is to use internal tree-sitter lexer/scanner for all the context-free constructs and only use external lexer/scanner for non context-free constructs. I've reached by email to the library author to see what we can do about it.

The external scanner/lexer is a classic very complex finite state machine, but any good C++ developer should be able to make changes there to compensate for tree-sitter evolution.

Following YAML fragments and error recovery are now supported:

asyncapi: 2.4.0
info: 
  version: '1.0.0'
   title: Something # Badly indented
openapi: 3.1.0
info:
  summary: Update an existing pet
  desc
  title: test title

Regarding the first example, we can provide better error message by inspect if the error message contains entire YAML fragment, which means - it's an YAML syntax error and not semantic linting error.

char0n commented 4 days ago

We first need to be clear about ikatyang/tree-sitter-yaml#15 before we proceed any further.

Replaced by https://github.com/tree-sitter-grammars/tree-sitter-yaml/issues/14

char0n commented 4 days ago

@frantuma @char0n Is it possible to at least enhance the error messaging around this? I have bumped into a few invalid YAML documents and have no idea how to fix them. Here is a contrived example...

asyncapi: 2.4.0
info: 
  version: '1.0.0'
   title: Something # Badly indented

The error message is the entire contents... (Error asyncapi: 2.4.0 info: version: '1.0.0')

How can we enhance this experience? Other YAML parsers will gives us the coordinates, so something like "Bad indentation on Line 4"

Yes @ponelat,

After successfully processing https://github.com/swagger-api/apidom/issues/4033 we're now getting the following results:

image

char0n commented 4 days ago

Here is another fragments that doesn't work in newest version web-tree-sitter@0.20.8 and tree-sitter-cli@0.20.8:

openapi: 3.1.0
info:
  summary: Update an existing pet
  desc
  title: test title

Works in latest version of YAML grammar after successfully processing https://github.com/swagger-api/apidom/issues/4033

image

char0n commented 4 days ago

I've just issued #2881 to provide maximum error recovery for YAML parsing. For this I retained support for Node.js@18 and Node.js@20 (Node.js env), but used older versions of web-tree-sitter@0.20.3 and tree-sitter-cli@0.20.4 to generate WASM fragments (Browser env).

I also did deep research of tree-sitter-yaml. The grammar there is solid and time proven. Given that tree-sitter is built for context-free grammars and YAML is not context free grammar, the tree-sitter-yaml uses external scanner/lexer to provide compensation. Unfortunately this scanner becomes more and more incompatible with tree-sitter as it evolves.

What we need to do with tree-sitter-yaml is to use internal tree-sitter lexer/scanner for all the context-free constructs and only use external lexer/scanner for non context-free constructs. I've reached by email to the library author to see what we can do about it.

The external scanner/lexer is a classic very complex finite state machine, but any good C++ developer should be able to make changes there to compensate for tree-sitter evolution.

Following YAML fragments and error recovery are now supported:

asyncapi: 2.4.0
info: 
  version: '1.0.0'
   title: Something # Badly indented
openapi: 3.1.0
info:
  summary: Update an existing pet
  desc
  title: test title

Regarding the first example, we can provide better error message by inspect if the error message contains entire YAML fragment, which means - it's an YAML syntax error and not semantic linting error.

After new YAML grammar has been successfully utilized, we've been able to update tree-sitter tool chain to latest versions and add support for Node.js 22 (23 respectively). Both of the examples described in above comment are now working without any issues.

char0n commented 4 days ago

Keeping this issue open as the original subject of this issue has not been resolved yet: https://github.com/tree-sitter-grammars/tree-sitter-yaml/issues/14