lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.88k stars 414 forks source link

Verbatim reproduction of content inside parentheses #1158

Closed ymeiron closed 2 years ago

ymeiron commented 2 years ago

What is your question?

What is the best way to verbatim reproduce content inside parentheses? That content includes more parentheses (guaranteed balanced) and whitespaces. Whitespaces should otherwise be ignored in the grammar.

Provide a small script that encapsulates your issue.

from lark.lark import Lark
from lark.visitors import Interpreter

parser = Lark(r"""
    start: _statement*
    _statement: func_statement | hello_statement
    func_statement: "FUNC(" func_content ")"
    func_content: parens_content ("(" func_content ")")? parens_content
    parens_content: ANYTHING_IN_PARENS*
    ANYTHING_IN_PARENS: /[^\)]/
    hello_statement: "HELLO(" [/\w+/] ")"
    %import common.WS
    %ignore WS
""", start='start')

parse_tree = parser.parse(r'HELLO(start) FUNC(prefix(in(ter)nal)suffix end) HELLO(finish)')

class My_interpreter(Interpreter):
    def func_content(self, tree):
        result = self.visit_children(tree)
        return '(' + ''.join(result) + ')'
    def parens_content(self, tree):
        return ''.join(tree.children)
    def hello_statement(self, tree):
        return tree.children[0].value

result = My_interpreter().visit(parse_tree)
print(result)

Explain what you're trying to do, and what is obstructing your progress.

For the example above, I want to obtain the string prefix(in(ter)nal)suffix end; my program almost works, it doesn't capture the space between suffix and end. It looks like %ignore is quite strong and takes precedence. I can get it to work if I write the whitespaces into the rules, but that makes the rest of the grammar (outside the minimal example) quite difficult. Another not-so-elegant option is to preprocess the input before feeding it to Lark. Any takes on this?

erezsh commented 2 years ago

Balancing parenthesis requires a parser, this is something a normal lexer can't do.

When using LALR, %ignore respects priority:

>>> l = Lark('''start: "a" WS "b"
    WS.10: " "
    %ignore " "
''', parser='lalr')
>>> l.parse('a b')
Tree('start', [Token('WS', ' ')])

When using Earley, priority isn't even required:

>>> l = Lark('''start: "a" WS "b"
     WS: " "
     %ignore " "
''')
>>> l.parse('a b')
Tree('start', [Token('WS', ' ')])

So I think it's worth trying to just parse it using the Lark parser, i.e. by specifying the structure in the grammar.

But if that doesn't work, the alternative is to implement your own lexer, that will handle this special case, and yield to the regular Lark parser for the rest.

ymeiron commented 2 years ago

Thanks! Changing to LALR and setting the WS priority to negative one, did the trick. My grammar is now:

start: _statement*
_statement: func_statement | hello_statement
func_statement: "FUNC(" func_content ")"
func_content: parens_content ("(" func_content ")")? parens_content
parens_content: ANYTHING_IN_PARENS*
ANYTHING_IN_PARENS: /[^\)]/ | "()"
hello_statement: "HELLO(" [/\w+/] ")"
WS.-1: /[ \t\f\r\n]/+
%ignore WS

(I also realized since this morning that empty parentheses caused an error, but adjusting ANYTHING_IN_PARENS as above fixed that issue as well)