Closed rkujawa closed 5 years ago
I've been experiencing a similar problem with an (ambiguous) grammar I've defined. Parsing a set string with a set grammar yields different results each time I call parse
, even when creating a new Lark
object each time. Maybe there's some non-deterministic behaviour in how the Earley algorithm has been implemented? I've tried out the 0.7b
branch and the problem persists.
Yes, that's likely. Earley is a tricky algorithm, and I haven't yet had the time to sit down and rewrite it to my satisfaction. I do think 0.7b should be an improvement on the existing implementation, but at the same time I know it has imperfections.
I don't know what to suggest, except that you wait patiently for an update. If you have any idea, I'll be happy to consider them.
Of course, I don't mind waiting. I'll try and see if I can find anything on master or 0.7b that might be causing the non-deterministic behavior. I'm not very familiar on how the Earley algorithm works so it might take a while.
Thanks for creating this library!
@erezsh I noticed that the RulePtr
class implements __hash__
by returning the hash of (rule, index
), but class Rule
does not implement __hash__
(checked on 0.7b).
I added a quick implementation of __hash__
and __eq__
to Rule
using repr(self)
as value and OP's example has become deterministic. Could this be related to the problem?
@federicotdn I think you might be onto it! I will take a deeper look tomorrow.
(Although of course repr(self)
is too inefficient, so I'll use a different solution)
@federicotdn I pushed the change, just in case it will fix a few of these cases.
However, the bigger issue remains at large. It's still non-deterministic, both at master and at 0.7b
Thanks!
Have you tried experimenting with PYTHONHASHSEED
? (docs) I noticed that some parts of the code iterate over sets and dictionaries. Depending on the version of Python being used, this order could change between each run (apart from not having any specific order in the first place).
If using PYTHONHASHSEED=0
solves the determinism, then it means that one or more parts of the code could be relying on a specific iteration order of sets/dictionaries instead of just depending on the specified grammar and text input.
Thanks @federicotdn , this is a really helpful suggestion! It does seem like a fixed PYTHONHASHSEED improves the stability of Earley's behavior. It appears to be entirely deterministic now when using the standard lexer. However, there are still minor variances when using the dynamic lexer, and so the result is still not deterministic (but it definitely seems closer now). I can't imagine why that would be the case..
Glad it helped! If parts of the code are still non-deterministic when using a fixed PYTHONHASHSEED
, then the problem may be that a class (not str
, bytes
or datetime
) is missing its __hash__
function, and somewhere the code is iterating over sets/dictionaries of this class.
I've been working with 0.7b and noticed similar non-determinism introduced by https://github.com/lark-parser/lark/commit/04d90fa9165741333c2e394592a64e1c3966aa14. One commit earlier, there does not appear to be any non-determinism in my use-case. The non-determinism still persists for me with 0.7c.
@jacobperron I can reproduce the non-deterministic behavior with 0.7b
[76e185a36cfba49d01d5f579ecfa7ab619c83463].
With 0.7c
[65d3212bed936a0ac3c4a5980aad056c7d0afe60] as well as 0.7d
[862a853340c27070991a1a25dcf0385f507eec22] it seems to be correct and deterministic though.
@dirk-thomas I can confirm that the latest 0.7c
(at 65d3212b) brings back the correct behavior. 0.7d
seems good too.
Yes, we've corrected this bug recently. I'm glad to hear that it seems to be working for others as well.
Hi. Frist, I have to admit that I am new both to Lark and Python (I came from the land of C). I'm trying to write a simple assembler.
I've prepared the following Lark grammar:
(disregard the fact that there are only two mnemonics defined for now)
I've also got a test file that looks like:
The parser works consistently on this example. However, if I change the first line to be a comment, to look like:
Then approximately half of the time, I'll get the expected parsing output, and the other half of the time I get an empty token at the beginning:
Any hints on how to handle this?