Open bcr opened 1 year ago
Yes, I believe this happens because Lark uses sets, which change their order randomly based on python's hash-seed.
I don't consider this a bug, per se, but I will accept a PR that fixes it. (if the fix is reasonable)
Alternatively, you can run Lark with a constant PYTHONHASHSEED - https://docs.python.org/3.4/using/cmdline.html#envvar-PYTHONHASHSEED
In my case, I have a script that generates the module and I check the module into git, so I tend to run the generation script periodically to make sure I didn't change the grammar. I could be an adult and only generate the module when the grammar changes of course. Thanks for the PYTHONHASHSEED suggestion though. I think I'll just get over it and do a better job on my end.
I'm trying to help someone integrate Lark-js into a repo, and one of our requirements is that generated files have a CI check that verifies that the generated file is up to date. For lark-js, this is "generate the file, check if there are differences between the generated file and the checked in version". With Lark being non-deterministic, this check is impossible.
When you tried the PYTHONHASHSEED suggestion what happened?
that eliminates... some of the randomness... it seems like the remaining is due to the memoization stuff
just as a PoC, replacing the lark-js generation stuff with this:
def generate_js_standalone(lark_inst):
"""Returns a string containing the Javascript standalone parser, for the given Lark instance
"""
if lark_inst.options.parser != 'lalr':
raise NotImplementedError("Lark.js only works with LALR parsers for now")
data, memo = lark_inst.memo_serialize([TerminalDef, Rule])
remapped_memo = [i for i in range(len(memo))]
remapped_memo.sort(key=lambda i: json.dumps(memo[i]))
def walk(data, f):
data = f(data)
if isinstance(data, list):
return [walk(i, f) for i in data]
elif isinstance(data, dict):
return {k: walk(v, f) for k, v in data.items()}
else:
return data
def remap(v):
if isinstance(v, dict):
if '@' in v:
v['@'] = remapped_memo.index(v['@'])
return v
data = walk(data, remap)
memo = {i: memo[remapped_memo[i]] for i in range(len(memo))}
data_json = json.dumps(data, indent=2)
memo_json = json.dumps(memo, indent=2)
with open(__dir__ / 'lark.js') as lark_js:
output = lark_js.read()
output += '\nvar DATA=%s;\n' % data_json
output += '\nvar MEMO=%s;\n' % memo_json
return output
seems to remove the rest of the randomness, but... that's probably the wrong place to mess with the memoization stuff?
Where exactly does the randomness come from with a fixed PYTHONHASHSEED
? memoization should only be random because of dict
/set
ordering.
There might still be randomness with a fixed PYTHONHASHSEED if we rely on id()
for hashing, which might happen inadvertently since it's the default behavior for Python objects.
Just ran into this as well, any updates? I'd be happy to open a PR if you have any pointers :)
I don't know if this is still an issue or not. But we introduced an OrderedSet
implementation in Lark. So a solution would be to simply use it instead of sets everywhere.
Describe the bug
When generating output using the minimal grammar of
start:
(as well as more complex grammars, but this reproduces it in my environment) the output varies. On cursory examination running several times, there appear to be two variants that differ by thestates
that are generated.To Reproduce
The diff output on my machine is summarized below.