lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.9k stars 415 forks source link

Earley parser produces non-deterministic results even with explicit terminal weights #201

Closed shawnanastasio closed 5 years ago

shawnanastasio commented 6 years ago

Perhaps related to #191

When using the earley parser with weighted terminals, the result is still non-deterministic in certain cases.

Example:

from lark import Lark

grammar = """
root: A | B

A.1: ("0".."9")+
B.2: ("0".."9")+

%import common.WS
%ignore WS
"""

parser = Lark(grammar, start="root", parser="earley")
print(parser.parse("10"))

It is my understanding that using weights was a workaround for this non-determinism (as indicated in #167), but this does not seem to be the case.

I am using the latest lark version available on pip (0.6.2) on Fedora Linux 27 x86_64.

erezsh commented 6 years ago

Earley currently does not support weights on terminals. But it should work as you expect it to with weights on rules. Such as:

root: a | b 

a.1: ("0".."9")+
b.2: ("0".."9")+

This is probably not the best behavior. Hopefully I'll get the chance to fix it soon.

shawnanastasio commented 6 years ago

I see. Perhaps you could help me work around this issue with a project I'm working on.

In the project I have terminals similar to the following for unquoted strings, decimal numbers, and hexadecimal numbers:

root: STR | DEC | HEX

STR: /\S/+
DEC: ("0".."9")+
HEX: ("0x" ("0".."9" | "A".."F" | "a".."f")+)

I'd like to be able to have things like "10" always parse as a decimal and "0xFF" as a hexadecimal, but the non-deterministic behavior is causing them to sometimes be interpreted as strings.

Is there an easy way I can refactor this to work around the unexpected behavior?

erezsh commented 6 years ago

Of course. Just add intermediate rules.

root: STR | DEC | hex

hex.2: HEX

etc.

Hope this helps!

shawnanastasio commented 6 years ago

Thanks for the advice. It seems to work in some cases but not in others.

For example, with the grammar and text below, the hexadecimal literal seems to be interpreted as a string unless I remove the string terminal entirely, in which case it is interpreted as a hexadecimal.

from lark import Lark

grammar = """
root: "BEGIN" statement+ "END"
statement: assignment

assignment: IDENTIFIER "=" rvalue
?rvalue: decimal
      | hexadecimal
      | STRING

IDENTIFIER: ("A".."z" | "0".."9")+

// basic types
STRING: /\S/+
DECIMAL: ("0".."9")+
HEXADECIMAL: ("0x" ("0".."9" | "A".."F" | "a".."f")+)

// wrappers for ambiguous types above
decimal.2: DECIMAL
hexadecimal.2: HEXADECIMAL

%import common.WS
%ignore WS
"""

text = """
BEGIN
    MY_VAR = 0xDEADBEEF
END
"""

parser = Lark(grammar, start="root", parser="earley")
print(parser.parse(text))

I'm not sure if this is an issue with my grammar or with the parser. The lalr parser is able to produce the correct behavior every time, but it is unfortunately unsuitable for the project I'm currently working on.

Any assistance would be greatly appreciated.

erezsh commented 6 years ago

Hmm yeah, this is an actual bug with the current implementation.

I'm currently working on a new earley implementation that should solve this problem. You can check out the branch earley_sppf and try it for yourself. It's still under work, but it's fairly stable. Let me know how it works out for you!

shawnanastasio commented 6 years ago

I was able to switch over to lalr for the time being. Thank you very much for your help.

If you feel that this issue falls under the scope of #191, please feel free to close it.

erezsh commented 5 years ago

0.7 published