lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.92k stars 417 forks source link

Lark failing non-deterministically #9

Closed richardcooper closed 7 years ago

richardcooper commented 7 years ago

Running the following lambda calculus parser on Python 3.6 exhibits non-determistic behaviour. That is, the exact same code run multiple times gives different results.

from lark import Lark

parser = Lark("""
    ?term: "(" term ")"
         | /true/
         | /false/
         | /if/ term /then/ term /else/ term
         | var
         | /λ/ var /:/ type /./ term
         | term term

    ?type: "(" type ")"
         | /Bool/
         | type /→/ type

    ?var: /[a-z]+'*/

    %import common.WS
    %ignore WS
""", lexer='contextual', start='term', parser='lalr')

print(parser.parse("(λa:(Bool→Bool). a) (λb:Bool. b)").pretty())

It seems that the cause of the non-determinism is PYTHONHASHSEED. On my machine PYTHONHASHSEED=3 produces the following result consistently

$ PYTHONHASHSEED=3 python lark_error.py 
term
  λ
  b
  :
  Bool
  .
  b

Whereas PYTHONHASHSEED=1 raises an exception every time.

$ PYTHONHASHSEED=1 python lark_error.py 
Traceback (most recent call last):
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/parsers/lalr_parser.py", line 31, in get_action
    return states_idx[state][key]
KeyError: 'ANONRE_7'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "lark_error.py", line 22, in <module>
    print(parser.parse("(λa:(Bool→Bool). a)  (λb:Bool. b)").pretty())
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/lark.py", line 184, in parse
    return self.parser.parse(text)
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/parser_frontends.py", line 48, in parse
    return self.parser.parse(tokens, self.lexer.set_parser_state)
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/parsers/lalr_parser.py", line 60, in parse
    action, arg = get_action(token.type)
  File "/Users/richard/Documents/lark_test/venv/lib/python3.6/site-packages/lark/parsers/lalr_parser.py", line 35, in get_action
    raise UnexpectedToken(token, expected, seq, i)
lark.common.UnexpectedToken: Unexpected token Token(ANONRE_7, '→') at line 1, column 9.
Expected: dict_keys(['__RPAR', 'ANONRE_9'])
Context: <no context>

If I don't specify PYTHONHASHSEED at all then both results happen about 50% of the time.

(As an aside, even when it "works" it's not producing the parse tree I'm expecting. The whole "λa" branch of the parse seems to be missing. That might be user error though.)

erezsh commented 7 years ago

Hi Richard, thank you for bringing this to my attention.

It embarrasses me to admit, but apparently there is a bug in my lalr generator. It's possible that your grammar is not lalr(1)-compatible, but then lark should produce a clear exception. The fact that it didn't, but instead produced wrong results, is troubling. I will have to look deeper into it. I believe the non-determinism will disappear once that bug is resolved.

Regardless, there is also a small error in your grammar: The regular expressions for "var" and "true"/"false" are in collision, which means that variable names like "falsetto" can make the lexer produce undesired results. It's better to avoid regexps if at all possible: Lark can automatically resolve a collision between a regexp and a string, but not between two regexps.

Anyway, until I fix my bug, allow me to suggest that you use the Earley algorithm, which parses your grammar correctly. It still uses a lexer, so you won't feel a big difference in performance (and probably won't notice at all).

Here is your grammar with tiny modifications, working with Earley: (It works on my machine :tm: )

from lark import Lark

parser = Lark("""
    ?term: "(" term ")"
         | "true" -> true
         | "false" -> false
         | "if" term "then" term "else" term
         | var
         | "λ" var ":" type "." term -> lambda
         | term term

    ?type: "(" type ")"
         | "Bool" -> bool
         | type "→" type

    ?var: VAR

    VAR: ("a".."z")+ "'"*

    %import common.WS
    %ignore WS
""", start='term', lexer='standard', parser='earley', ambiguity='explicit')

print(parser.parse("(λa:(Bool→Bool). a) (λb:Bool. b)").pretty())

I hope this helps. I will keep you updated regarding the bug in the lalr parser generator.

erezsh commented 7 years ago

Hi again, I did not yet address the issue of lark's misreporting of grammar errors, however --

I did fix the bug which made some branches disappear. Apparently there was a subtle bug that you activated with the "term: term term" production. Well, it's fixed now, and I get consistent and seemingly correct results running the modified grammar with LALR.

So, replace the last line with:

""", start='term', lexer='standard', parser='lalr')        

And let me know if works, or if there are other issues hindering your progress.

(I'm keeping the issue open, since the error reports are still problematic.)

richardcooper commented 7 years ago

Using the 'lalr ' parser in the latest code does seem to have fixed both the vanishing branch bug and the non-determinism when I use the modified grammar. Thank you!

When I use my original grammar (with all the regexes) the vanishing branch bug is fixed but the non-determinism is still present (As expected)

I had originally used regexes because I want to keep those terminals. I'll open a new issue about that.

erezsh commented 7 years ago

I'm considering this issue as fixed. The only thing left is to improve the error messages.