lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.75k stars 401 forks source link

How do I make a recursive binary parse tree using lalr? #1296

Closed RevanthRameshkumar closed 1 year ago

RevanthRameshkumar commented 1 year ago

What is your question?

I understand how the python operator parsing works using the * strategy, but this makes an operator node have any number of children. I want to make a tree where each operator only has a left and right child...but because of the lalr collision issue, I cannot. Is there a strategy for this? The reason I cannot use earley parser is because it won't work when using the Indenter for postlex (due to some parsing issue with this grammar).

grammar:

%import common.CNAME -> CNAME
%import common.WORD
%import common.WS_INLINE

%declare _INDENT _DEDENT
%ignore WS_INLINE

_NL: /(\r?\n[\t ]*)+/

start: expr

variable: /z\d+/
predicate: CNAME "(" variable  ")"

?expr: or_expr
?or_expr: and_expr ("|" and_expr)?
?and_expr: atom ("&" atom)?
?atom: CNAME | "(" expr ")" | predicate

string

a & b & c

expected parse:

▶ start
  ▶ and_expr
  a
    ▶ and_expr
    b
    c

If you're having trouble with your code or grammar

Provide a small script that encapsulates your issue.

Explain what you're trying to do, and what is obstructing your progress.

RevanthRameshkumar commented 1 year ago

Ok, this actually works with lalr, but IDK why:

%import common.CNAME -> CNAME
%import common.WORD
%import common.WS_INLINE

%declare _INDENT _DEDENT
%ignore WS_INLINE

_NL: /(\r?\n[\t ]*)+/

start: bool_e

variable: /z\d+/
predicate: CNAME "(" variable  ")"

bool_e: or_e | and_e
or_e: left "|" right
and_e: left "&" right
left: CNAME
right: bool_e | CNAME

I was expecting it to fail due to bool_e being referenced in right

RevanthRameshkumar commented 1 year ago

oh nvm, the reason that strategy might be bad is due to operator precedence: this would produce an incorrect parse

bool_e: or_e | xor_e | and_e
or_e: left "|" right
xor_e: left "XOR" right
and_e: left "&" right
left: CNAME
right: bool_e | CNAME
a | b & c & e XOR d
RevanthRameshkumar commented 1 year ago

Update, I ended up just using the larl parser and sticking to the * decoding strategy where I get a list of all vars in the expression...so a & b & c will return and_expr(a, b, c), and I just use a transformer to make this into a binary tree via

@dataclass
class LCExpression:
  op: str
  left: Any
  right: Any

class MyTransformer(Transformer):
    def _process_LC_expression(self, op, items):
        if len(items) < 2:
          raise ValueError
        elif len(items) == 2:
          right_items = items[1]
        elif len(items) > 2:
          right_items = self._process_LC_expression(op, items[1:])
        return LCExpression(op= op, left=items[0], right=right_items)

    def or_expr(self, items):
        return self._process_LC_expression('OR', items)

    ...
erezsh commented 1 year ago

Glad you figured it out!