Discrimination between "a" and "aa".

lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.

MIT License

4.87k stars 413 forks source link

Discrimination between "a" and "aa". #1061

Closed tgdcat closed 4 months ago

tgdcat commented 2 years ago

Thank you for creating Lark.

I want to use Lark to distinguish between "a" and "aa". (I'm sorry if it's an already mentioned issue.)

Run the following sample program.

from lark import Lark

text = 'a aa aaa'

parser = Lark('''
    start:  statement*
    statement: a | aa | aaa
    a: "a"
    aa: "aa"
    aaa: "aaa"

    %import common.WS
    %ignore WS
''', start='start')

tree = parser.parse(text)

The result I want is:

start
  statement
    a
  statement
    aa
  statement
    aaa

But every time I actually do it, the result is different.


# First time
start
  statement
    a
  statement
    a
  statement
    a
  statement
    aa
  statement
    a

# Second time
start
  statement
    a
  statement
    a
  statement
    a
  statement
    aaa

# Third time
start
  statement
    a
  statement
    aa
  statement
    a
  statement
    aa

Python 3.8.7 lark-parser 0.12.0

My approach may be wrong. I would appreciate it if you could tell me the solution.

erezsh commented 2 years ago

It's a known issue with the Earley parser, that it isn't always deterministic in choosing a derivation when there's ambiguity.

A straight-forward solution would be do give explicit priority to the longer rules:

    a: "a"
    aa.1: "aa"
    aaa.2: "aaa"

Another way is to force to lexer to expect a whitespace after the string:

    a: /a(?=\s)/
    aa: /aa(?=\s)/
    aaa: /aaa/

(%ignore doesn't force a separation in Earley by design)

Another "fix" would be using lexer="basic", since it removes the ambiguity. You can also do parser="lalr".

Having said all this, I do think ideally your use case should work as-is without any of these fixes. So feel free to leave this open, and maybe we'll get to it some day.

tgdcat commented 2 years ago

Thank you for the quick response.

By prioritizing as you say, I was able to get the results I expected. I am grateful for your support.

Having said all this, I do think ideally your use case should work as-is without any of these fixes. So feel free to leave this open, and maybe we'll get to it some day.

Yes. Keep it open until you can resolve the ambiguity without adding anything.

erezsh commented 4 months ago

Now returns a consistent output:

start
  statement
    a
  statement
    a
  statement
    a
  statement
    a
  statement
    a
  statement
    a

🎉