Extra optional symbol affect on choice between two regexps

lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

MIT License

4.78k stars 406 forks source link

Extra optional symbol affect on choice between two regexps #1156

Open ObjatieGroba opened 2 years ago

ObjatieGroba commented 2 years ago

Should extra optional symbol affect on choice between two regexps?

Lets check example:

import lark

parser = lark.Lark('''
?start: s
%ignore /[\\n]/+

R1: /[\\w\\+]/+
R2: /(\\w)/+

SPACE: " "

space: (SPACE+)?

s: ("$" space R1) | ("@" space R2)
''', parser='lalr')

for example in ('$x',
                '@x',
                '$ x',
                '@ y'):
    print(example)
    print(parser.parse(example).pretty())

Both $ and @ parse correctly without space before R1 and R2.

The 4th example raise exception: Unexpected token Token('R1', 'y') at line 1, column 3. Expected one of: * R2.

Is it a bug (there are no compile exception) or feature? If feature, how to fix that?

ObjatieGroba commented 2 years ago

Replacing space with any other symbol (for example ".") leads to the same result

ObjatieGroba commented 2 years ago

As I suppose

Feature described at docs can look only through previous token, isn't it?

That's why before $ and @ it have the only one choice, when after space it is possible both of regexps.

MegaIng commented 2 years ago

This can't really be avoided. Contextual lexer doesn't quite do what you want, it's still limited by the LALR parser. If you want to do this kind of stuff, either make sure that R1 and R2 don't conflict or use parser='earley'

ObjatieGroba commented 2 years ago

Thank you, @MegaIng

It is something that is not quite intuitive (combining this parsing cases into one pool of tokens).

How can I be sure that each two of my regexps does not meet at some parsing point (including recursive cases)?

It should be for lark to have flag that warn about any crossing regexps(

For example for s: ("$" R1) | ("@" "$" R2) lalr separate regexps successfully