lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Trouble working with the contextual lexer #1249

Closed wstevick closed 1 year ago

wstevick commented 1 year ago

I'm working on toy language, and I'm using lark to write a parser for it. I want to be able to use newlines to separate statements, but also to split up long statements. Here's my minimal code.

start: function_def*

function_def: WORD block

block: "{" (statement STATEMENT_SEP)* "}"

?statement: WORD (WORD | block)

WORD: /\w+/
STATEMENT_SEP.1: /\r?\n/+

%import common.WS
%ignore WS

This is with the lalr parser and contextual lexer. My thought is that because I've given STATEMENT_SEP a higher precedence than WS, the lexer will try to match new lines to it first. But because it's the contextual lexer, if a newline is the middle of a statement (in this case, something like "a \n b), it'll match to WS instead. Here's my test code:

thing {
a b
c
    d
e {}
}

otherthing
    {}

When I try to parse it though, I get this error message.

lark.exceptions.UnexpectedToken: Unexpected token Token('STATEMENT_SEP', '\n\n') at line 6, column 2.
Expected one of:
        * WORD
        * $END

I'm assuming this is a problem with my code. What am I doing wrong?

erezsh commented 1 year ago

/The LALR parser just isn't that good at disambiguating. Once it knows that STATEMENT_SEP could follow a block, it will always check for it, even if in this specific context it won't happen. (it doesn't know that, because it's nested context).

The solution is to restructure your grammar so that for every rule that STATEMENT_SEP follows, it always follows it.

Here is an example solution for your toy example, I can't how well it will fit into the full language:

from lark import Lark

grammar = r"""
start: function_def*

function_def: WORD block

block: "{" statement* "}"
word_block: "{" statement* "}"

?statement: (WORD WORD | WORD word_block) STATEMENT_SEP

WORD: /\w+/
STATEMENT_SEP.1: /\r?\n/+

%import common.WS
%ignore WS
"""

text = r"""
thing {
a b
c
    d
e {}
}

otherthing
    {}
"""

parser = Lark(grammar, parser="lalr")
print(parser.parse(text))

Or just use Earley, which should be able to handle it without any additional effort on your end.

wstevick commented 1 year ago

Thanks, that fixed it for me.