lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.62k stars 395 forks source link

Transforming tree after standalone parser results in different AST #1400

Closed pdeibert closed 3 months ago

pdeibert commented 3 months ago

I am playing around with the standalone parser and am getting some strange behavior with the parse trees. Parsing an example with LALR and a generated standalone parser (based on the exact same grammar) results in seemingly identical trees. But when transforming them with the same transformer I get different results. Passing the transformer directly to the standalone parser again gives the correct behavior. Based on the example below it seems that the tree is traversed top-down instead of bottom-up, but I do not understand why or how to control this behavior.

Full example (including console output):

from lark import Lark, Transformer

example_grammar = r'''
program             :   literal DOT

literal             :   ID
                    |   ID PAREN_OPEN PAREN_CLOSE
                    |   ID PAREN_OPEN terms PAREN_CLOSE

terms               :   terms COMMA term
                    |   term

term                :   ID
                    |   VARIABLE

ID                  :   /[a-z]\w*/
VARIABLE            :   /[A-Z]\w*/
DOT                 :   "."
COMMA               :   ","
PAREN_OPEN          :   "("
PAREN_CLOSE         :   ")"

%import common.WS
%ignore WS
'''
# > example_grammar.lark
# python -m lark.tools.standalone -s program example_grammar.lark > example_parser.py

from example_parser import Lark_StandAlone

lalr_parser = Lark(
    example_grammar,
    parser='lalr',
    start="program",
)
standalone_parser = Lark_StandAlone()

# ----- Some AST classes -----
class Program:
    def __init__(self, literal):
        self.literal = literal

    def __str__(self):
        return f"Program({str(self.literal)})"

class Literal:
    def __init__(self, identifier, terms=None):
        self.id = identifier
        self.terms = terms

    def __str__(self):
        return f"Literal({str(self.id)}:{','.join([str(term) for term in self.terms])})"

class Identifier:
    def __init__(self, symbol):
        self.symbol = symbol

    def __str__(self):
        return f"Identifier({self.symbol})"

class Variable:
    def __init__(self, symbol):
        self.symbol = symbol

    def __str__(self):
        return f"Variable({self.symbol})"

# ----- Transformer -----
class ExTransformer(Transformer):
    def program(self, args):
        print("Transforming 'program'")
        return Program(args[0])

    def literal(self, args):
        print("Transforming 'literal'")
        identifier = args[0]
        terms = args[2] if len(args) > 3 else None

        return Literal(identifier, terms)

    def terms(self, args):
        print("Transforming 'terms'")
        terms = [args[-1]]

        if len(args) > 1:
            terms = args[0] + terms

        return terms

    def term(self, args):
        print("Transforming 'term'")
        if args[0].type == "ID":
            return Identifier(args[0].value)
        else:
            return Variable(args[0].value)

transformer = ExTransformer()

# example input
test_prog = "p(X)."

# parsing
tree_lalr = lalr_parser.parse(test_prog)
tree_standalone = standalone_parser.parse(test_prog)

# transforming
print(tree_lalr.pretty())
print()
"""
program
  literal
    p
    (
    terms
      term X
    )
  .
"""

print(tree_standalone.pretty())
print()
"""
program
  literal
    p
    (
    terms
      term X
    )
  .
"""

print(tree_lalr == tree_standalone)
print()
"""
True
"""

print(transformer.transform(tree_lalr))
print()
"""
Transforming 'term'
Transforming 'terms'
Transforming 'literal'
Transforming 'program'
Program(Literal(p:Variable(X)))
"""

print(transformer.transform(tree_standalone))
print()
"""
Transforming 'program'
Program(Tree(Token('RULE', 'literal'), [Token('ID', 'p'), Token('PAREN_OPEN', '('), Tree(Token('RULE', 'terms'), [Tree(Token('RULE', 'term'), [Token('VARIABLE', 'X')])]), Token('PAREN_CLOSE', ')')]))
"""

print(Lark_StandAlone(transformer=transformer).parse(test_prog))
"""
Transforming 'term'
Transforming 'terms'
Transforming 'literal'
Transforming 'program'
Program(Literal(p:Variable(X)))
"""

I am using Lark 1.1.9 and Python 3.11.3

MegaIng commented 3 months ago

You can't mix standalone code and the normal lark distrubtion. lark.Tree and example_parser.Tree are different classes, and lark.Transformer is not going to recognize example_parser.Tree. Use example_parser.Transformer for the standalone tree.

pdeibert commented 3 months ago

I was unaware of this, but it makes perfect sense.

Does this also include the transformer directly passed to the standalone parser upon creation? In the example the output seems to be correct, but I was wondering if I should also use example_parser.Transformer here.

MegaIng commented 3 months ago

You should. In fact, you should never use both a generated standalone parser and the distribution in any situation.

The directly passed transformer happens to work -- in fact, it doesn't need to be a subclass of Transformer at all, only it's methods are extracted and used later on. But you shouldn't rely on this.

pdeibert commented 3 months ago

Thank you for clarifying!