lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.64k stars 397 forks source link

Referencing capturing groups errors when using LALR #1352

Closed mx781 closed 9 months ago

mx781 commented 9 months ago

Are capturing groups not supported when using LALR?

The following works:

Lark(r"start: /[bd]([aeiou])\1[bd]/").parse("deed")

while this does not:

Lark(r"start: /[bd]([aeiou])\1[bd]/", parser="lalr").parse("deed")

and throws

[more stacktrace omitted]

File /usr/lib/python3.11/re/_parser.py:433, in _escape(source, escape, state)
    431 if group < state.groups:
    432     if not state.checkgroup(group):
--> 433         raise source.error("cannot refer to an open group",
    434                            len(escape))
    435     state.checklookbehindgroup(group, source)
    436     return GROUPREF, group

error: cannot refer to an open group at position 26

If so, is there any documentation on what regex features are supported by which parser?

Thanks for the great library!

MegaIng commented 9 months ago

No, you can't use numbered refrence groups since all regex get put together into one regex, which messes with the numbering. named capture groups should however work (if the name doesn't collied with the names used internally for Terminals).

However, I would strongly recommend not using capture groups (or more specifically, backrefrences). They make the Terminals non-regular (and sometimes even context sensitive), which potentially messes with the parser in that sense that it does stuff you don't expect. Almost always you are better off just using the CFG level features, i.e. the actual parser and grammar syntax.

mx781 commented 9 months ago

Got it, thanks for the clarification! I thought of accomplishing this via just grammar, but couldn't find a way to encapsulate repeated tokens that way either. The repetition feature, i.e.

consonant: "q" | "x"
vowel: "u" | "o"
word: consonant vowel ~ 2..4 consonant

is in the right direction, but doesn't require the vowel tokens to be identical, so quox is matches as well quux. Is there a grammar feature I've missed to accomplish something like this?

erezsh commented 9 months ago

but doesn't require the vowel tokens to be identical

That would be context sensitive matching, which Lark doesn't support.

The only way to do it is using regexps.

Perhaps a context-free parser isn't the best tool for your task?

But if you do need a parser, you have the option to give Lark your own custom lexer.

mx781 commented 9 months ago

got it - that clarifies things. thanks for the quick responses!

erezsh commented 9 months ago

On a second thought, if you really are working with letters, which is a fixed amount of tokens, you could do something like

vowel_2to4: "a"~2..4 
            | "e"~2..4 
            | "i"~2..4  
            | "o"~2..4 
            | "u"~2..4 

That should work, but as you can see it's a hack rather than a general purpose solution.