lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.77k stars 404 forks source link

Standalone module generation output varies #1194

Open bcr opened 1 year ago

bcr commented 1 year ago

Describe the bug

When generating output using the minimal grammar of start: (as well as more complex grammars, but this reproduces it in my environment) the output varies. On cursory examination running several times, there appear to be two variants that differ by the states that are generated.

To Reproduce

echo "start:" > test.g
python3 -m lark.tools.standalone test.g > test.py
python3 -m lark.tools.standalone test.g > test1.py
python3 -m lark.tools.standalone test.g > test2.py
python3 -m lark.tools.standalone test.g > test3.py
diff test.py test1.py
diff test.py test2.py
diff test.py test3.py

The diff output on my machine is summarized below.

--- test.py 2022-09-22 23:17:36.000000000 -0700
+++ test1.py    2022-09-22 23:17:39.000000000 -0700
@@ -3091,7 +3091,7 @@

 import pickle, zlib, base64
 DATA = (
-{'parser': {'lexer_conf': {'terminals': [], 'ignore': [], 'g_regex_flags': 0, 'use_bytes': False, 'lexer_type': 'contextual', '__type__': 'LexerConf'}, 'parser_conf': {'rules': [{'@': 0}], 'start': ['start'], 'parser_type': 'lalr', '__type__': 'ParserConf'}, 'parser': {'tokens': {0: 'start', 1: '$END'}, 'states': {0: {0: (0, 1), 1: (1, {'@': 0})}, 1: {}}, 'start_states': {'start': 0}, 'end_states': {'start': 1}}, '__type__': 'ParsingFrontend'}, 'rules': [{'@': 0}], 'options': {'debug': False, 'keep_all_tokens': False, 'tree_class': None, 'cache': False, 'postlex': None, 'parser': 'lalr', 'lexer': 'contextual', 'transformer': None, 'start': ['start'], 'priority': 'normal', 'ambiguity': 'auto', 'regex': False, 'propagate_positions': False, 'lexer_callbacks': {}, 'maybe_placeholders': False, 'edit_terminals': None, 'g_regex_flags': 0, 'use_bytes': False, 'import_paths': [], 'source_path': None, '_plugins': {}}, '__type__': 'Lark'}
+{'parser': {'lexer_conf': {'terminals': [], 'ignore': [], 'g_regex_flags': 0, 'use_bytes': False, 'lexer_type': 'contextual', '__type__': 'LexerConf'}, 'parser_conf': {'rules': [{'@': 0}], 'start': ['start'], 'parser_type': 'lalr', '__type__': 'ParserConf'}, 'parser': {'tokens': {0: 'start', 1: '$END'}, 'states': {0: {}, 1: {0: (0, 0), 1: (1, {'@': 0})}}, 'start_states': {'start': 1}, 'end_states': {'start': 0}}, '__type__': 'ParsingFrontend'}, 'rules': [{'@': 0}], 'options': {'debug': False, 'keep_all_tokens': False, 'tree_class': None, 'cache': False, 'postlex': None, 'parser': 'lalr', 'lexer': 'contextual', 'transformer': None, 'start': ['start'], 'priority': 'normal', 'ambiguity': 'auto', 'regex': False, 'propagate_positions': False, 'lexer_callbacks': {}, 'maybe_placeholders': False, 'edit_terminals': None, 'g_regex_flags': 0, 'use_bytes': False, 'import_paths': [], 'source_path': None, '_plugins': {}}, '__type__': 'Lark'}
 )
 MEMO = (
 {0: {'origin': {'name': Token('RULE', 'start'), '__type__': 'NonTerminal'}, 'expansion': [], 'order': 0, 'alias': None, 'options': {'keep_all_tokens': False, 'expand1': False, 'priority': None, 'template_source': None, 'empty_indices': (), '__type__': 'RuleOptions'}, '__type__': 'Rule'}}
erezsh commented 1 year ago

Yes, I believe this happens because Lark uses sets, which change their order randomly based on python's hash-seed.

I don't consider this a bug, per se, but I will accept a PR that fixes it. (if the fix is reasonable)

erezsh commented 1 year ago

Alternatively, you can run Lark with a constant PYTHONHASHSEED - https://docs.python.org/3.4/using/cmdline.html#envvar-PYTHONHASHSEED

bcr commented 1 year ago

In my case, I have a script that generates the module and I check the module into git, so I tend to run the generation script periodically to make sure I didn't change the grammar. I could be an adult and only generate the module when the grammar changes of course. Thanks for the PYTHONHASHSEED suggestion though. I think I'll just get over it and do a better job on my end.

andyleap commented 11 months ago

I'm trying to help someone integrate Lark-js into a repo, and one of our requirements is that generated files have a CI check that verifies that the generated file is up to date. For lark-js, this is "generate the file, check if there are differences between the generated file and the checked in version". With Lark being non-deterministic, this check is impossible.

bcr commented 11 months ago

When you tried the PYTHONHASHSEED suggestion what happened?

andyleap commented 11 months ago

that eliminates... some of the randomness... it seems like the remaining is due to the memoization stuff

andyleap commented 11 months ago

just as a PoC, replacing the lark-js generation stuff with this:

def generate_js_standalone(lark_inst):
    """Returns a string containing the Javascript standalone parser, for the given Lark instance

    """
    if lark_inst.options.parser != 'lalr':
        raise NotImplementedError("Lark.js only works with LALR parsers for now")

    data, memo = lark_inst.memo_serialize([TerminalDef, Rule])

    remapped_memo = [i for i in range(len(memo))]
    remapped_memo.sort(key=lambda i: json.dumps(memo[i]))

    def walk(data, f):
        data = f(data)
        if isinstance(data, list):
            return [walk(i, f) for i in data]
        elif isinstance(data, dict):
            return {k: walk(v, f) for k, v in data.items()}
        else:
            return data

    def remap(v):
        if isinstance(v, dict):
            if '@' in v:
                v['@'] = remapped_memo.index(v['@'])
        return v

    data = walk(data, remap)
    memo = {i: memo[remapped_memo[i]] for i in range(len(memo))}

    data_json = json.dumps(data, indent=2)
    memo_json = json.dumps(memo, indent=2)

    with open(__dir__ / 'lark.js') as lark_js:
        output = lark_js.read()
        output += '\nvar DATA=%s;\n' % data_json
        output += '\nvar MEMO=%s;\n' % memo_json
    return output

seems to remove the rest of the randomness, but... that's probably the wrong place to mess with the memoization stuff?

MegaIng commented 11 months ago

Where exactly does the randomness come from with a fixed PYTHONHASHSEED? memoization should only be random because of dict/set ordering.

erezsh commented 11 months ago

There might still be randomness with a fixed PYTHONHASHSEED if we rely on id() for hashing, which might happen inadvertently since it's the default behavior for Python objects.

RobertCraigie commented 7 months ago

Just ran into this as well, any updates? I'd be happy to open a PR if you have any pointers :)

erezsh commented 5 months ago

I don't know if this is still an issue or not. But we introduced an OrderedSet implementation in Lark. So a solution would be to simply use it instead of sets everywhere.