lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.8k stars 409 forks source link

Trying to understand non-deterministic behaviour #278

Closed rkujawa closed 5 years ago

rkujawa commented 5 years ago

Hi. Frist, I have to admit that I am new both to Lark and Python (I came from the land of C). I'm trying to write a simple assembler.

I've prepared the following Lark grammar:

start: source+
source: [LABEL_DEF+] MNEMONIC [[IMM_OPERAND|OPERAND|SYMBOL][","IMM_OPERAND|","OPERAND|","SYMBOL]] [COMMENT] _NL -> instruction
      | META [META_ARG] [COMMENT] _NL -> meta
      | [COMMENT] _NL

LABEL_DEF: SYMBOL":"
SYMBOL: NAME
META_ARG: DEC_NUMBER|HEX_NUMBER
OPERAND: DEC_NUMBER|HEX_NUMBER
IMM_OPERAND: IMM_DEC_NUMBER|IMM_HEX_NUMBER
MNEMONIC: "nop"|"stp"
META: ".org"
DEC_NUMBER: /[1-9]\d*l?/i
IMM_DEC_NUMBER: /#[1-9]\d*l?/i
HEX_NUMBER: /0x[\da-f]*l?/i
IMM_HEX_NUMBER: /#0x[\da-f]*l?/i
COMMENT: ";"/.*/
WHITE: " "|"\t"
%import common.CNAME -> NAME
%import common.LETTER
%import common.INT -> NUMBER
%import common.NEWLINE -> _NL
%ignore WHITE
%ignore COMMENT
%ignore _NL

(disregard the fact that there are only two mnemonics defined for now)

I've also got a test file that looks like:

.org 0xC00

foo:    nop 1
    nop 0x1,1   ; fdas
foo2:
foo3:
    nop #0x1,1
    nop lab1
    stp #2

; comment
    nop     ; fds
; fds

The parser works consistently on this example. However, if I change the first line to be a comment, to look like:

;.org 0xC00

Then approximately half of the time, I'll get the expected parsing output, and the other half of the time I get an empty token at the beginning:

[]
[Token(LABEL_DEF, 'foo:'), Token(MNEMONIC, 'nop'), Token(OPERAND, '1')]
[Token(MNEMONIC, 'nop'), Token(OPERAND, '0x1'), Token(OPERAND, '1')]
[Token(LABEL_DEF, 'foo2:'), Token(LABEL_DEF, 'foo3:'), Token(MNEMONIC, 'nop'), Token(IMM_OPERAND, '#0x1'), Token(OPERAND, '1')]
[Token(MNEMONIC, 'nop'), Token(SYMBOL, 'lab1')]
[Token(MNEMONIC, 'stp'), Token(IMM_OPERAND, '#2')]
[Token(MNEMONIC, 'nop')]

Any hints on how to handle this?

federicotdn commented 5 years ago

I've been experiencing a similar problem with an (ambiguous) grammar I've defined. Parsing a set string with a set grammar yields different results each time I call parse, even when creating a new Lark object each time. Maybe there's some non-deterministic behaviour in how the Earley algorithm has been implemented? I've tried out the 0.7b branch and the problem persists.

erezsh commented 5 years ago

Yes, that's likely. Earley is a tricky algorithm, and I haven't yet had the time to sit down and rewrite it to my satisfaction. I do think 0.7b should be an improvement on the existing implementation, but at the same time I know it has imperfections.

I don't know what to suggest, except that you wait patiently for an update. If you have any idea, I'll be happy to consider them.

federicotdn commented 5 years ago

Of course, I don't mind waiting. I'll try and see if I can find anything on master or 0.7b that might be causing the non-deterministic behavior. I'm not very familiar on how the Earley algorithm works so it might take a while.

Thanks for creating this library!

federicotdn commented 5 years ago

@erezsh I noticed that the RulePtr class implements __hash__ by returning the hash of (rule, index), but class Rule does not implement __hash__ (checked on 0.7b).

I added a quick implementation of __hash__ and __eq__ to Rule using repr(self) as value and OP's example has become deterministic. Could this be related to the problem?

erezsh commented 5 years ago

@federicotdn I think you might be onto it! I will take a deeper look tomorrow.

(Although of course repr(self) is too inefficient, so I'll use a different solution)

erezsh commented 5 years ago

@federicotdn I pushed the change, just in case it will fix a few of these cases.

However, the bigger issue remains at large. It's still non-deterministic, both at master and at 0.7b

federicotdn commented 5 years ago

Thanks!

Have you tried experimenting with PYTHONHASHSEED? (docs) I noticed that some parts of the code iterate over sets and dictionaries. Depending on the version of Python being used, this order could change between each run (apart from not having any specific order in the first place).

If using PYTHONHASHSEED=0 solves the determinism, then it means that one or more parts of the code could be relying on a specific iteration order of sets/dictionaries instead of just depending on the specified grammar and text input.

erezsh commented 5 years ago

Thanks @federicotdn , this is a really helpful suggestion! It does seem like a fixed PYTHONHASHSEED improves the stability of Earley's behavior. It appears to be entirely deterministic now when using the standard lexer. However, there are still minor variances when using the dynamic lexer, and so the result is still not deterministic (but it definitely seems closer now). I can't imagine why that would be the case..

federicotdn commented 5 years ago

Glad it helped! If parts of the code are still non-deterministic when using a fixed PYTHONHASHSEED, then the problem may be that a class (not str, bytes or datetime) is missing its __hash__ function, and somewhere the code is iterating over sets/dictionaries of this class.

jacobperron commented 5 years ago

I've been working with 0.7b and noticed similar non-determinism introduced by https://github.com/lark-parser/lark/commit/04d90fa9165741333c2e394592a64e1c3966aa14. One commit earlier, there does not appear to be any non-determinism in my use-case. The non-determinism still persists for me with 0.7c.

dirk-thomas commented 5 years ago

@jacobperron I can reproduce the non-deterministic behavior with 0.7b [76e185a36cfba49d01d5f579ecfa7ab619c83463].

With 0.7c [65d3212bed936a0ac3c4a5980aad056c7d0afe60] as well as 0.7d [862a853340c27070991a1a25dcf0385f507eec22] it seems to be correct and deterministic though.

jacobperron commented 5 years ago

@dirk-thomas I can confirm that the latest 0.7c (at 65d3212b) brings back the correct behavior. 0.7d seems good too.

erezsh commented 5 years ago

Yes, we've corrected this bug recently. I'm glad to hear that it seems to be working for others as well.