neogeny / TatSu

竜 TatSu generates Python parsers from grammars in a variation of EBNF
https://tatsu.readthedocs.io/
Other
403 stars 48 forks source link

Interpreting parseinfo #292

Closed bbbales2 closed 1 year ago

bbbales2 commented 1 year ago

I'm having trouble understanding the values in parseinfo. My goal is to provide errors by underlying the relevant parsed text from parseinfo provided with a node passed to me from a NodeWalker.

We can get some weirdnesses in the calc example by adding names to the AST, including a parseinfo True flag, and making the calc example multiline.

New grammar:

@@grammar::CALC
@@parseinfo :: True

start
    =
    expression $
    ;

expression
    =
    | left:expression '+' ~ right:term
    | left:expression '-' ~ right:term
    | term
    ;

term
    =
    | left:term '*' ~ right:factor
    | left:term '/' ~ right:factor
    | factor
    ;

factor
    =
    | '(' ~ expression:expression ')'
    | number
    ;

number
    =
    /\d+/
    ;

New test program:

import json
from pprint import pprint

import tatsu

def simple_parse():
    with open('calc_cut.ebnf') as f:
        grammar = f.read()

    parser = tatsu.compile(grammar)
    ast = parser.parse("""3 +
5 *
( 10 - 20 )""")

    print(ast.right.right)

if __name__ == '__main__':
    simple_parse()

A formatted version of the output looks like:

{
  "expression": {
    "left": "10",
    "right": "20",
    "parseinfo": {
      "tokenizer": null,
      "rule": "expression",
      "pos": 10,
      "endpos": 17,
      "line": 2,
      "endline": 2,
      "alerts": []
    }
  },
  "parseinfo": {
    "tokenizer": null,
    "rule": "factor",
    "pos": 7,
    "endpos": 19,
    "line": 1,
    "endline": 3,
    "alerts": []
  }
}

Given the original text was:

3 +
5 *
( 10 - 20 )

I'm having trouble figuring out how to interpret the lines and positions.

If we look at the outer object (corresponding to ( 10 - 20 )), pos is 7 and line is 1. That's surprising (I would have expected line 3, pos 1), but if we count to the 7th character from line 1 we do get to the first parenthesis, so that kinda works. The endline is 3 which makes sense, but the endpos is 19, which seems like the endpos must then be counted from line instead of endline? Also I think the closing parenthesis is the 17th character, so I don't know where the 19 comes from. Maybe that is consuming extra whitespace and isn't so interesting.

The inner object is more confusing. In this case line and endline are both 2, even though we're talking about 10 - 20 which is on the third line. Am I interpreting these values incorrectly, or is there something strange going on? Thanks for any help!

bbbales2 commented 1 year ago

Oh also I am using Tatsu 5.8.3

bbbales2 commented 1 year ago

Actually I think if I just ignore the line information and use the pos/endpos to index into the string in parseinfo.tokenizer.text I'm good to go. I can take those positions and work out what line everything is on which is good enough for me.

I'll leave this open cuz the line numbers still seem weird to me? But also I could be wrong! Feel free to close this out at your leisure. I have a way to solve my problem, thanks!

apalala commented 1 year ago

You could analyze the whole output, and not just a fragment. It would also be good to do it programmatically instead of visually.

In the fragment you posted it's obvious that the second parseinfo refers to the top-level expression, for which the AST is not shown, and it looks correct.

import json

import tatsu
from tatsu.util import asjson

def simple_parse():
    with open('grammars/calc_annotated.ebnf') as f:
        grammar = f.read()

    parser = tatsu.compile(grammar)
    ast = parser.parse(
"""3 +
5 *
( 10 - 20 )""",
        parseinfo=True,
    )

    print(json.dumps(asjson(ast), indent=2))

if __name__ == '__main__':
    simple_parse()
{
  "left": "3",
  "op": "+",
  "right": {
    "left": "5",
    "op": "*",
    "right": {
      "left": "10",
      "op": "-",
      "right": "20",
      "parseinfo": {
        "tokenizer": null,
        "rule": "expression",
        "pos": 10,
        "endpos": 17,
        "line": 2,
        "endline": 2,
        "alerts": []
      }
    },
    "parseinfo": {
      "tokenizer": null,
      "rule": "term",
      "pos": 4,
      "endpos": 19,
      "line": 1,
      "endline": 3,
      "alerts": []
    }
  },
  "parseinfo": {
    "tokenizer": null,
    "rule": "expression",
    "pos": 0,
    "endpos": 19,
    "line": 0,
    "endline": 3,
    "alerts": []
  }
}