lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.75k stars 401 forks source link

Earley parser produces wrong parse Tree #1283

Closed MegaIng closed 2 months ago

MegaIng commented 1 year ago

Over on SO someone asked this question: https://stackoverflow.com/questions/76366280/parsing-formulas-using-lark-ebnf/76381256

As far as I can tell, it shows a bug in the earley parser where it duplicates some Tokens because of enormous amounts of ambiguities. I don't have the expertise with earley to figure out what is happening or the time to create a minimal example right now.

chanicpanic commented 1 year ago

I have it down to:

grammar = """
start: (a+)*
!a.1: "A" |
"""

l = Lark(grammar)
tree = l.parse("A")
print(tree.pretty())

Output:

start
  a
  a     A
  a
  a     A

I suspect that the issue is somewhere in ForestToParseTree with resolve_ambiguity=True. I do not see the same problem in the tree returned when using ambiguity="explicit". I'll look into it more when I have a chance.

chanicpanic commented 1 year ago

The issue is that when ambiguity="resolve", the ParseTreeBuilder will use the ChildFilterLALR filters. However, the ForestToParseTree cache breaks the ChildFilterLALR assumption that it is safe to change the parse tree. We have two options:

  1. Disable the ForestToParseTree cache when ambiguity="resolve".
  2. Always use the regular ChildFilter for earley.

Either will work, but they may have different performance implications.

erezsh commented 1 year ago

@chanicpanic Thanks for looking into it!

Can you explain more about the performance implications?

chanicpanic commented 1 year ago

The ForestToParseTree cache and ChildFilterLALR filter are both optimizations that happen to be incompatible with each other. So, the question is: which optimization gives the most performance benefit for earley with ambiguity resolution?

While the ForestToParseTree cache provides a significant speed-up for explicit ambiguous parses, its impact on ambiguity resolution is likely minimal. My hypothesis is that the cache is only relevant when a rule matches an empty string, and even then, the forest nodes have to be visited in a particular order.

On the other hand, I believe that child filtering is an operation that is performed for many grammars and inputs. So, the ChildFilterLALR would be used much more often.

Thus, I am leaning toward option 1 because I expect the ChildFilterLALR to have a greater overall performance benefit.

erezsh commented 1 year ago

@chanicpanic Thanks for the explanation.

I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be?

I could run benchmarks on trick-grammars like start: start start | "(" start ")" |, and normal grammars that are low on ambiguity. But maybe you know some real-world examples that are ambiguous and would serve as a better measurement?

cc: @MegaIng

Shildifreak commented 1 year ago

I stumbled across what I think is the same problem. Here is the minimal example from my grammar:

GRAMMAR = r'''
start : _s x _s
x : "X"?
_s : " "?
'''

l = Lark(GRAMMAR)

# behave as expected
print(l.parse( " X " ).pretty())
print(l.parse( "X" ).pretty())
print(l.parse( " " ).pretty())

# produces two x nodes
print(l.parse( "" ).pretty())

output:

start
  x

start
  x

start
  x

start
  x
  x
chanicpanic commented 1 year ago

I think it's best to make an evidence-based choice, which makes me wonder, do you have an idea for what a good benchmark grammar&input for Earley would be?

I could run benchmarks on trick-grammars like start: start start | "(" start ")" |, and normal grammars that are low on ambiguity. But maybe you know some real-world examples that are ambiguous and would serve as a better measurement?

@erezsh I don't have any real-world grammars that I think would be a better benchmark for Earley. For this issue, I think it would be good to use the grammars in this thread, or variations that incorporate the same patterns, which are known to utilize the ForestToParseTree cache with ambiguity="resolve". Also, it would be good to benchmark with normal grammars that reflect more typical lark usage.

I made branches for the two options that can be used for comparison:

erezsh commented 2 months ago

Hi @chanicpanic and everyone,

Sorry it took me this long to look into this!

I did some benchmarks, and my conclusion is it that - it really doesn't matter.

Both approaches seem to have the same performance in my tests, with less than 5% difference.

So, just choose whatever seems more "correct". I will accept a PR of either approach.