lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 403 forks source link

Porting from pyparsing match_previous_literal #1437

Open kevemueller opened 2 months ago

kevemueller commented 2 months ago

What is your question? I am porting from pyparsing to Lark due to expectations on increased performance. Initial test show very promising.

One of the few constructs that I could not identify how to express in Lark is match_previous_literal. It allows to dynamically match based on a previously matched literal.

# pyparsing example
first = Word(nums)
match_expr = first + ":" + match_previous_literal(first)

(see pyparsing-docs.readthedocs.io

I need this functionality to match a sed search and replace like construct. ${var:S/from/to/g} including all of its equivalents, like e.g. ${var:S#from#to#g} I would like to match the character after the signaling token :S and use that as the delimiter for the expression. The delimited content must be parsed as well, i.e. may contain constructs like ${var2}.

If you're having trouble with your code or grammar

Currently I am using a workaround using templates simply listing some commonly used separators, but this is not exhaustive and not generic. Almost any character can be used as the separator

?sc_template{x, sep}: sep x sep x sep /[1gW]+/?
expansion_modifier_sc: /[SC]/ (sc_template{token, "/"} | sc_template{token, "%"})

I wonder if there is any possibility to define this generically in the grammar. I could live with an interim solution that retrieves the delimited content and run a parse on it in a post-processing step.

erezsh commented 2 months ago

LALR only parses context-free grammars. However, a function like match_previous_literal() is context-sensitive. So the parser can't help you there.

But a regular expression should be capable of matching this construct.

It sounds like the best solution is to create a regexp terminal that matches the entire expression, and then in post-processing run the regexp again to parse the string. (using groups)

kevemueller commented 2 months ago

Hi Erez,

thanks for pointing me into the right direction with RE.

I confirm that it is as easy as

expansion_modifier_sc: /[SC]/ /(.).*?\1.*?\1[1gW]*/

to back-reference in the RE the matching first character. The complete expression is handed to the post-processor which will then still need to take it apart using Python code. To complete the exercise I will still need to add escaping to the RE, but that is beyond Lark.

Thx for this nice tool.

kevemueller commented 2 months ago

Update: This works with earley/dynamic, but not with larl/contextual. larl/contextual adds additional groups which make calculating the backreference impossible (sometimes \1, sometimes \2).

The error message is re.error: cannot refer to an open group at position

The workaround is to use named capturing group which results in a grammar of

expansion_modifier_sc: /[SC]/ /(?P<sep>.).*?(?P=sep).*?(?P=sep)[1gW]*/

This is agnostic to the lexer adding additional groups.

erezsh commented 2 months ago

That makes sense. Maybe there's a way we could make it work, by fixing indexes for example.

Although requiring names for back-references almost sounds like a win :)