lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.62k stars 395 forks source link

Cant read `meta` from Tree or Token? #1403

Closed thehappycheese closed 3 months ago

thehappycheese commented 3 months ago

What is your question?

May I please have some help to understand why I can't read meta / metadata on from either Tree or Token on my parse result? (when manually walking the tree... see code snippet below)

My Hypothesis

Code Example

Click here to show the full grammar ```python grammar_text = r"""start: wsc* alternation wsc* ?alternation: inversion (wsc+ alternation_operator wsc+ inversion)* !alternation_operator: "AND"i | "OR"i ?inversion: (inversion_operator wsc+)* comparison inversion_operator : "NOT"i ?comparison : expression (wsc? comparison_operator wsc? expression)* !comparison_operator : "=" | "<>" | "<=" | ">=" | ">" | "<" ?expression : negated_term (wsc* expression_operator wsc* negated_term)* !expression_operator : "+" | "-" ?negated_term : (negation_operator wsc*)* term !negation_operator : "-" ?term : exponent (wsc* term_operator wsc* exponent)* term_operator : ("*" | "/") ?exponent : factor (exponent_operator factor)? exponent_operator : "**" ?factor : string | SIGNED_NUMBER -> number | parenthesized_factor | function_call | identifier parenthesized_factor : manual_open_paren wsc* alternation wsc* manual_close_paren !manual_open_paren : "(" !manual_close_paren : ")" ?function_call : basic_identifier (wsc* function_open_paren wsc* function_parameters? wsc* function_close_paren)? !function_open_paren : "(" !function_close_paren : ")" function_parameters : alternation (wsc* function_param_separator wsc* alternation)* function_param_separator : "," identifier : table_identifier | basic_identifier table_identifier : CNAME "->" CNAME basic_identifier : CNAME // whitespace or comment wsc: _whitespace | comment | inline_comment _whitespace: WS comment:CPP_COMMENT inline_comment:C_COMMENT // Strings string: SINGLE_QUOTED_STRING | DOUBLE_QUOTED_STRING _STRING_INNER: /(?:[^"\\]|\\(.|\n))*?/ _SINGLE_QUOTE_INNER: /(?:[^'\\]|\\(.|\n))*?/ DOUBLE_QUOTED_STRING : "\"" _STRING_INNER "\"" // For double-quoted strings. SINGLE_QUOTED_STRING : "'" _SINGLE_QUOTE_INNER "'" // For single-quoted strings. %import common.CNAME %import common.SIGNED_NUMBER %import common.WS %import common.WS_INLINE %import common.C_COMMENT %import common.CPP_COMMENT""" ```
# load grammar and parse a short example with comments
lgram = Lark(grammar_text)
tree = lgram.parse("""
//  first comment
(basic*expression)**2
// other comment
""")
print(tree.pretty())
  start
  wsc   

  wsc
    comment //  first comment
  wsc   

  exponent
    parenthesized_factor
      manual_open_paren (
      term
        basic_identifier    basic
        term_operator
        basic_identifier    expression
      manual_close_paren    )
    exponent_operator
    number  2
  wsc   

  wsc
    comment // other comment
  wsc

Now I understand I am not using the prescribed Transformer or Visitor methods to walk the tree but I was very excited to use the new python match/case... is there a reason this wont work?

# try to find top-level comments and print line-number for each
def visit(tree):
    match tree:
        # skip normal whitespace
        case Tree(
            data="wsc",
            children=[str(white_space)]
        ):
            pass
        # try to give line number of comments
        case Tree(
            data="wsc",
            children=[Tree(
                data=Token("RULE", "comment", line=line),
                children=[str(comment)],
                meta=meta
            )]
        ):
            print(f"Comment: line {line} {meta}") ## < ----------------- Problem happens here
        # recurse
        case Tree(data="start", children=children):
            for child in children:
                visit(child)
        # default: just print unhandled nodes 
        case Tree(data=data):
            print(data)

visit(tree)

Output is somehow incorrect line number, and an empty meta object :

Comment: line 53 <lark.tree.Meta object at 0x0000023523088250>
exponent
Comment: line 53 <lark.tree.Meta object at 0x0000023523920690>

(if I try read meta.line I get AttributeError: 'Meta' object has no attribute 'line')??

Many thanks <3 awesome project :)

erezsh commented 3 months ago

Well, you have to set propagate_positions=True to enable the Tree meta .

thehappycheese commented 3 months ago

Ahh thanks so much @erezsh :) Really appreciate that you took the time. I did not spot that in the docs... I searched for a bunch of things along the lines of "metadata" but i guess thats no substitute for carefully reading the whole thing

I changed

lgram = Lark(grammar_text, propagate_positions=True)

and

print(f"Comment: line Token({line=}) Tree({meta.line=})")
Comment: line Token(line=53) Tree(meta.line=2)
exponent
Comment: line Token(line=53) Tree(meta.line=4)

Now I get the correct line when reading the metadata from the Tree(meta.line), but interestingly the same seemingly incorrect result when reading Token(line). Probably i will need to do some more careful reading 😕

erezsh commented 3 months ago

I'm sure you'll figure it out.

I suggest starting with a minimal example and debugging it. (minimal grammar, minimal input)

If you have it, but still can't figure it out, paste it here and I'll take a look.

erezsh commented 3 months ago

Many thanks <3 awesome project :)

Thanks :)

thehappycheese commented 3 months ago

Ahhh now I get it... of course, Token(line) is the line that the rule comment was declared in the grammar 🤦‍♂️