lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.61k stars 395 forks source link

Implement cache for earley parser #1348

Open jtwaleson opened 9 months ago

jtwaleson commented 9 months ago

Suggestion Lark currently supports caching to disk for the LALR parser, but it would be great to support it for the Earley parser too. We have a project that has around 3k different grammars using the Earley parser, and keeping all those parsers in memory is a bit of a strain on our server.

Describe alternatives you've considered Right new we cache the parsers in memory, but it would be great if we can also cache to disk to save resources. I've tried to "just" disable the "lalr safeguard" in the cache method, but got a pickle error.

Additional context I would like to contribute a PR but the task seems rather daunting as I have no experience with the project. If someone can tell me that it is actually pretty feasible and give me some pointers, I'd be willing to give it a go.

erezsh commented 9 months ago

I have a few questions. Why do you need to hold 3k different parsers in memory at the same time?

Are the parsers all kept in their initial state, or are they all parsing at the same time?

Are you having performance issues in loading the grammars from scratch?

I've tried to "just" disable the "lalr safeguard"

The safeguard is there because caching isn't implemented for Earley.

jtwaleson commented 9 months ago

I have a few questions.

I can imagine ;)

Why do you need to hold 3k different parsers in memory at the same time?

The project supports 18 different parsers, and about 50 different i18n languages, each with 2 boolean configuration settings. The permutations multiply quickly.

Are the parsers all kept in their initial state, or are they all parsing at the same time?

They are parsing but only one at a time. After that I think the parser is reset (but to be honest I'm pretty new to the project so I don't know for sure yet).

Are you having performance issues in loading the grammars from scratch?

Yes, it adds about 1 second of delay to all requests. Also in our CI implementation we have about 30k test runs, and re-using the parsers with a memory cache gives about a 4-5x performance boost, but at a big memory cost.

I've tried to "just" disable the "lalr safeguard"

The safeguard is there because caching isn't implemented for Earley.

I know, I just put it there because the template asked what alternatives I've considered, and to show that I made an effort ;)

MegaIng commented 9 months ago

FWIW, I would expect the stdlib pickle module to just work for earley based Lark objects.

Yes, it adds about 1 second of delay to all requests

Are these grammars really large? Constructing an earley parser shouldn't take a second most of the time. Or are you reconstructing all parsers at once?

Also, if you have performance considerations, I would strongly recommend trying to switch to lalr for all grammars. That will be a very significant performance and memory usage boost.

jtwaleson commented 9 months ago

FWIW, I would expect the stdlib pickle module to just work for earley based Lark objects.

Do you mean like this, or that it should be possible with some small changes?

        parser = Lark(grammar, regex=True, propagate_positions=True, keep_all_tokens=keep_all_tokens)
        with open("attempt-to-cache.pkl", "wb") as fp:
>           pickle.dump(parser, fp)
E           TypeError: cannot pickle 'module' object

Yes, it adds about 1 second of delay to all requests

Are these grammars really large? Constructing an earley parser shouldn't take a second most of the time. Or are you reconstructing all parsers at once?

No not really large I think, about 170 lines of grammar. They are constructed on-demand. I just tracked the time more specifically and locally it's more like 0.2 to 0.3 seconds, in CI a bit more.

Also, if you have performance considerations, I would strongly recommend trying to switch to lalr for all grammars. That will be a very significant performance and memory usage boost.

Alright, we'll consider that, thanks for the advice!

erezsh commented 9 months ago

Serializing / pickling Earley shouldn't be too hard. Essentially, Earley accepts only a few params, and does very little processing on them. So, if you serialize these params, you can re-create the Earley object rather easily.

The params are:

    def __init__(self, lexer_conf: 'LexerConf', parser_conf: 'ParserConf', term_matcher, resolve_ambiguity=True, debug=False, tree_class=Tree):

Both lexer_conf and parser_conf support the .serialize() and .deserialize() method. The other paramters can be pickled as necessary.

For a generic solution that would make it into Lark, you'll have to add some logic to parser_frontends.py, but if you just want to hack a solution for yourself, you might even be able to avoid that.

about 50 different i18n languages, each with 2 boolean configuration settings

Perhaps an even easier solution would be to merge these implementations. You only have 18 actually different grammars, so maybe you could implement the i18n by overriding Lark.terminals before parsing, and changing the booleans as needed

jtwaleson commented 9 months ago

Thank you so much for the help and quick responses @erezsh ! I'll give it a try. If it's easy to add to Lark, would you be open to a PR or is this not something you'd prioritize? Feel free to close this issue or keep it open for future reference, whatever you prefer.

erezsh commented 9 months ago

Yes, we'll be happy to accept a PR that extends the existing interface to support Earley.

I can't promise how soon I'll get to it, since I have other obligations, but I'll try to be responsive.

We usually close issues after a solution PR has been merged. (when you open a PR, you can mention this issue to link them)

jtwaleson commented 9 months ago

Ok I believe I got it working, but not in a shape that's fit for a PR.

The only thing that was not "pickle-able" was the re_module in parser_conf! So I just set that to None while pickling and resetting it to regex like below.

Note that this is not the production code yet, just a PoC, we want to hash the grammar in the filename etc. It speeds up our test suite INCREDIBLY, as previously we had to re-run parser generation at least once for all processes to store them in memory. Now it's almost instant!

# TODO implement grammar hash into this
pickle_file = f"cached-parser-{level}-{lang}-{keep_all_tokens}-{skip_faulty}.pkl"

if os.path.isfile(pickle_file):
    # TODO on failure delete the cache file and continue fresh
    with open(pickle_file, "rb") as fp:
        frontend = pickle.load(fp)
    frontend.parser.lexer_conf.re_module = regex
    frontend.lexer_conf.re_module = regex
    return frontend

grammar = create_grammar(level, lang, skip_faulty)
lark = Lark(grammar, regex=True, propagate_positions=True, keep_all_tokens=keep_all_tokens)

frontend = lark.parser

frontend.parser.lexer_conf.re_module = None
frontend.lexer_conf.re_module = None

with open(filename + ".tmp", "wb") as fp:
    pickle.dump(frontend, fp)
try:
    # atomically replace the file
    os.rename(pickle_file + ".tmp", filename)
except Exception:
    #concurrent renames might result in a file disappearing
    pass

frontend.parser.lexer_conf.re_module = regex
frontend.lexer_conf.re_module = regex

return frontend
jtwaleson commented 9 months ago

Serializing / pickling Earley shouldn't be too hard. Essentially, Earley accepts only a few params, and does very little processing on them. So, if you serialize these params, you can re-create the Earley object rather easily.

FYI I didn't find this to be the case. Here is a flamegraph I made of the parser generation.

output

Most of the time was spent on calculate_sets which is done within the Earley init function.

erezsh commented 9 months ago

I see, that makes sense. I forgot it happens internally, since that function is called for LALR anyway too. Perhaps it should be refactored out.

Another thing I recently noticed is that some of the calculations in calculate_sets aren't used by Earley, so possibly the function can be split so to spare some of the calculation time.