Closed richardcooper closed 7 years ago
Hi Richard, thank you for bringing this to my attention.
It embarrasses me to admit, but apparently there is a bug in my lalr generator. It's possible that your grammar is not lalr(1)-compatible, but then lark should produce a clear exception. The fact that it didn't, but instead produced wrong results, is troubling. I will have to look deeper into it. I believe the non-determinism will disappear once that bug is resolved.
Regardless, there is also a small error in your grammar: The regular expressions for "var" and "true"/"false" are in collision, which means that variable names like "falsetto" can make the lexer produce undesired results. It's better to avoid regexps if at all possible: Lark can automatically resolve a collision between a regexp and a string, but not between two regexps.
Anyway, until I fix my bug, allow me to suggest that you use the Earley algorithm, which parses your grammar correctly. It still uses a lexer, so you won't feel a big difference in performance (and probably won't notice at all).
Here is your grammar with tiny modifications, working with Earley: (It works on my machine :tm: )
from lark import Lark
parser = Lark("""
?term: "(" term ")"
| "true" -> true
| "false" -> false
| "if" term "then" term "else" term
| var
| "λ" var ":" type "." term -> lambda
| term term
?type: "(" type ")"
| "Bool" -> bool
| type "→" type
?var: VAR
VAR: ("a".."z")+ "'"*
%import common.WS
%ignore WS
""", start='term', lexer='standard', parser='earley', ambiguity='explicit')
print(parser.parse("(λa:(Bool→Bool). a) (λb:Bool. b)").pretty())
I hope this helps. I will keep you updated regarding the bug in the lalr parser generator.
Hi again, I did not yet address the issue of lark's misreporting of grammar errors, however --
I did fix the bug which made some branches disappear. Apparently there was a subtle bug that you activated with the "term: term term" production. Well, it's fixed now, and I get consistent and seemingly correct results running the modified grammar with LALR.
So, replace the last line with:
""", start='term', lexer='standard', parser='lalr')
And let me know if works, or if there are other issues hindering your progress.
(I'm keeping the issue open, since the error reports are still problematic.)
Using the 'lalr ' parser in the latest code does seem to have fixed both the vanishing branch bug and the non-determinism when I use the modified grammar. Thank you!
When I use my original grammar (with all the regexes) the vanishing branch bug is fixed but the non-determinism is still present (As expected)
I had originally used regexes because I want to keep those terminals. I'll open a new issue about that.
I'm considering this issue as fixed. The only thing left is to improve the error messages.
Running the following lambda calculus parser on Python 3.6 exhibits non-determistic behaviour. That is, the exact same code run multiple times gives different results.
It seems that the cause of the non-determinism is PYTHONHASHSEED. On my machine PYTHONHASHSEED=3 produces the following result consistently
Whereas PYTHONHASHSEED=1 raises an exception every time.
If I don't specify PYTHONHASHSEED at all then both results happen about 50% of the time.
(As an aside, even when it "works" it's not producing the parse tree I'm expecting. The whole "λa" branch of the parse seems to be missing. That might be user error though.)