ikatyang / tree-sitter-yaml

YAML grammar for tree-sitter
https://ikatyang.github.io/tree-sitter-yaml/
MIT License
94 stars 38 forks source link

escape_sequence node is omitting information #45

Open jgomezb11 opened 1 year ago

jgomezb11 commented 1 year ago

When trying to force a keyword or number into a string as in the following example:

foo: "*"
foo2: '&'
number: '10'

The parsed result omits everything inside the double or single quotes, which means a loss of information compared to the initial file. Everything inside the quotes should be considered as a string and parsed as such, not omitted.

Or am I missing something?

char0n commented 1 year ago

Using https://ikatyang.github.io/tree-sitter-yaml/ to reproduce, your fixture parses as correct CST tree

foo: `double_quote_scalar`
foo2: `single_quote_scalar`
number: `single_quote_scalar`
jgomezb11 commented 1 year ago

You are right, it parses correctly... but it still omits information. Another example to make myself clear:

Using the same tool, https://ikatyang.github.io/tree-sitter-yaml/, if you try to parse:

foo: "\n"

It will generate a double_quote_scalar node that has a child named escape_sequence which refers to the information inside the quotes (in this case \n)...

That doesn't happen when trying to force a keyword into a string like my first example (foo: "*"). The parsed result has a double_quote_scalar without a child that refers to the content inside the quotes.

That's why I'm saying that the parser omits information.

char0n commented 1 year ago

It will generate a double_quote_scalar node that has a child named escape_sequence which refers to the information inside the quotes (in this case \n)...

Exactly. parser detected that double_quote_scalar CST node has child of escape_sequence.

That doesn't happen when trying to force a keyword into a string like my first example (foo: "*"). The parsed result has a double_quote_scalar without a child that refers to the content inside the quotes.

Because the double_quote_scalar CST node doesn't have an escape_sequence child, as the original source string doesn't contain escape sequences.

That's why I'm saying that the parser omits information.

I don't see your point. It does not omit anything. It parses what it sees. If it intercept the escape sequence in double quote scalar it will parse it, it if doesn't see any escape sequences in double quote scalar, it does not produce any CST nodes.

jgomezb11 commented 1 year ago

If it intercept the escape sequence in double quote scalar it will parse it, it if doesn't see any escape sequences in double quote scalar, it does not produce any CST nodes. That's exactly the problem.

You are right when you say that the escape_sequence node only looks for occurrences of an escape sequence but then there should be another type of node that matches the contents of the quotes if it does not have escape sequences; otherwise, it is as if that the source string does not exist.

Another example

foo: "foo \n"

In this case there is a child node that points to \n but there isn't a child node that refers to the first part of the string (foo) resulting in a loss of information.

Graphical representation of the example:

image

As you can see there is a child node that points to a newline but the rest of the source string is nowhere to be found.

char0n commented 1 year ago

Right, I understand what you're saying now.

I'm not an author of this library, but I use grammar to create syntactic analyzer on top of the CST, that this grammar produces. In the case of foo: "foo \n", I take the content of double_quote_scalar node and run an unraw operation on it.

I don't care if the double_quote_scalar contains escape_sequence. Can't you just ignore escape_sequence as I'm doing?

jgomezb11 commented 1 year ago

Oh, that's interesting... I'll look to see if I can implement something similar. Thank you for replying to my issue I hope some maintainer will someday look into this as well.

char0n commented 1 year ago

Np, just try to think of it, as if double_quote_scalar not having any children and escape_sequence doesn't exist. I use this implementation of unraw in javascript: https://www.npmjs.com/package/unraw

There will be tools for other languages in their standard or vendor libraries I'm sure.

There are actually more things that needs to be done for getting value out of double_quote_scalar: here is implementation I did some time ago: https://github.com/swagger-api/apidom/blob/main/packages/apidom-ast/src/yaml/schemas/canonical-format.ts#L142