lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.8k stars 409 forks source link

Inconsistent parse results from simple ambiguous grammar #1434

Closed mcarpenter closed 3 months ago

mcarpenter commented 3 months ago

Describe the bug

I have a small grammar that is a starting point to parse regular expressions. When run repeatedly on the same input Lark produces two different parse trees.

There is an ambiguity in my toy grammar because a RE quantifier (*, +, ?) can also be interpreted as a printable_char (ASCII values between space and tilde). However my reading previous issues (#81, #83) indicates inconsistent parse results should be treated as a bug (and it was certainly a surprise).

To Reproduce

Lark 1.1.9, Python 3.10.12.

#!/usr/bin/env python

from lark import Lark

parser = Lark('''
start: factor+

?factor: atom quantifier?

quantifier: "*" | "+" | "?"

?atom: DOT | printable_char

printable_char: /[ -~]/

DOT: "."
''')

regex = r'a.?'
tree = parser.parse(regex)
print(tree.pretty())

Sample results

start
  printable_char    a
  .
  printable_char    ?
start
  printable_char    a
  factor
    .
    quantifier
MegaIng commented 3 months ago

Does this still happen with the latest master? (pip install git+https://github.com/lark-parser/lark)

mcarpenter commented 3 months ago

Yes, it does still happen. (pip list reports 1.2.0, I still get two different results).

I also see discussion https://github.com/lark-parser/lark/discussions/1288 and setting environment variable PYTHONHASHSEED=0 does indeed yield consistent results (the second of those presented above). Similarly increasing the priority to quantifier.2 also fixes it for me.

erezsh commented 3 months ago

@mcarpenter Try the latest master again. (you might need to uninstall lark first)

mcarpenter commented 3 months ago

That got it, thanks.