Open jtwaleson opened 9 months ago
I have a few questions. Why do you need to hold 3k different parsers in memory at the same time?
Are the parsers all kept in their initial state, or are they all parsing at the same time?
Are you having performance issues in loading the grammars from scratch?
I've tried to "just" disable the "lalr safeguard"
The safeguard is there because caching isn't implemented for Earley.
I have a few questions.
I can imagine ;)
Why do you need to hold 3k different parsers in memory at the same time?
The project supports 18 different parsers, and about 50 different i18n languages, each with 2 boolean configuration settings. The permutations multiply quickly.
Are the parsers all kept in their initial state, or are they all parsing at the same time?
They are parsing but only one at a time. After that I think the parser is reset (but to be honest I'm pretty new to the project so I don't know for sure yet).
Are you having performance issues in loading the grammars from scratch?
Yes, it adds about 1 second of delay to all requests. Also in our CI implementation we have about 30k test runs, and re-using the parsers with a memory cache gives about a 4-5x performance boost, but at a big memory cost.
I've tried to "just" disable the "lalr safeguard"
The safeguard is there because caching isn't implemented for Earley.
I know, I just put it there because the template asked what alternatives I've considered, and to show that I made an effort ;)
FWIW, I would expect the stdlib pickle
module to just work for earley based Lark objects.
Yes, it adds about 1 second of delay to all requests
Are these grammars really large? Constructing an earley parser shouldn't take a second most of the time. Or are you reconstructing all parsers at once?
Also, if you have performance considerations, I would strongly recommend trying to switch to lalr for all grammars. That will be a very significant performance and memory usage boost.
FWIW, I would expect the stdlib
pickle
module to just work for earley based Lark objects.
Do you mean like this, or that it should be possible with some small changes?
parser = Lark(grammar, regex=True, propagate_positions=True, keep_all_tokens=keep_all_tokens)
with open("attempt-to-cache.pkl", "wb") as fp:
> pickle.dump(parser, fp)
E TypeError: cannot pickle 'module' object
Yes, it adds about 1 second of delay to all requests
Are these grammars really large? Constructing an earley parser shouldn't take a second most of the time. Or are you reconstructing all parsers at once?
No not really large I think, about 170 lines of grammar. They are constructed on-demand. I just tracked the time more specifically and locally it's more like 0.2 to 0.3 seconds, in CI a bit more.
Also, if you have performance considerations, I would strongly recommend trying to switch to lalr for all grammars. That will be a very significant performance and memory usage boost.
Alright, we'll consider that, thanks for the advice!
Serializing / pickling Earley shouldn't be too hard. Essentially, Earley accepts only a few params, and does very little processing on them. So, if you serialize these params, you can re-create the Earley object rather easily.
The params are:
def __init__(self, lexer_conf: 'LexerConf', parser_conf: 'ParserConf', term_matcher, resolve_ambiguity=True, debug=False, tree_class=Tree):
Both lexer_conf
and parser_conf
support the .serialize()
and .deserialize()
method. The other paramters can be pickled as necessary.
For a generic solution that would make it into Lark, you'll have to add some logic to parser_frontends.py
, but if you just want to hack a solution for yourself, you might even be able to avoid that.
about 50 different i18n languages, each with 2 boolean configuration settings
Perhaps an even easier solution would be to merge these implementations. You only have 18 actually different grammars, so maybe you could implement the i18n by overriding Lark.terminals before parsing, and changing the booleans as needed
Thank you so much for the help and quick responses @erezsh ! I'll give it a try. If it's easy to add to Lark, would you be open to a PR or is this not something you'd prioritize? Feel free to close this issue or keep it open for future reference, whatever you prefer.
Yes, we'll be happy to accept a PR that extends the existing interface to support Earley.
I can't promise how soon I'll get to it, since I have other obligations, but I'll try to be responsive.
We usually close issues after a solution PR has been merged. (when you open a PR, you can mention this issue to link them)
Ok I believe I got it working, but not in a shape that's fit for a PR.
The only thing that was not "pickle-able" was the re_module in parser_conf! So I just set that to None
while pickling and resetting it to regex
like below.
Note that this is not the production code yet, just a PoC, we want to hash the grammar in the filename etc. It speeds up our test suite INCREDIBLY, as previously we had to re-run parser generation at least once for all processes to store them in memory. Now it's almost instant!
# TODO implement grammar hash into this
pickle_file = f"cached-parser-{level}-{lang}-{keep_all_tokens}-{skip_faulty}.pkl"
if os.path.isfile(pickle_file):
# TODO on failure delete the cache file and continue fresh
with open(pickle_file, "rb") as fp:
frontend = pickle.load(fp)
frontend.parser.lexer_conf.re_module = regex
frontend.lexer_conf.re_module = regex
return frontend
grammar = create_grammar(level, lang, skip_faulty)
lark = Lark(grammar, regex=True, propagate_positions=True, keep_all_tokens=keep_all_tokens)
frontend = lark.parser
frontend.parser.lexer_conf.re_module = None
frontend.lexer_conf.re_module = None
with open(filename + ".tmp", "wb") as fp:
pickle.dump(frontend, fp)
try:
# atomically replace the file
os.rename(pickle_file + ".tmp", filename)
except Exception:
#concurrent renames might result in a file disappearing
pass
frontend.parser.lexer_conf.re_module = regex
frontend.lexer_conf.re_module = regex
return frontend
Serializing / pickling Earley shouldn't be too hard. Essentially, Earley accepts only a few params, and does very little processing on them. So, if you serialize these params, you can re-create the Earley object rather easily.
FYI I didn't find this to be the case. Here is a flamegraph I made of the parser generation.
Most of the time was spent on calculate_sets
which is done within the Earley init function.
I see, that makes sense. I forgot it happens internally, since that function is called for LALR anyway too. Perhaps it should be refactored out.
Another thing I recently noticed is that some of the calculations in calculate_sets
aren't used by Earley, so possibly the function can be split so to spare some of the calculation time.
Suggestion Lark currently supports caching to disk for the LALR parser, but it would be great to support it for the Earley parser too. We have a project that has around 3k different grammars using the Earley parser, and keeping all those parsers in memory is a bit of a strain on our server.
Describe alternatives you've considered Right new we cache the parsers in memory, but it would be great if we can also cache to disk to save resources. I've tried to "just" disable the "lalr safeguard" in the cache method, but got a pickle error.
Additional context I would like to contribute a PR but the task seems rather daunting as I have no experience with the project. If someone can tell me that it is actually pretty feasible and give me some pointers, I'd be willing to give it a go.