Closed erezsh closed 4 years ago
Proposed interface:
cache=None will disable cache
cache='auto' (default) will use a temporary file selected by Lark, based on hash
cache=[Path object] will use that path for saving the file, still considering hash to prevent versioning issues.
It would be important to include the lark version inside the hash to make sure that lark doesn't try to load an incompatible parser.
Another set of values is also possible, and perhaps preferable:
cache=False will disable cache
cache=True (default) will use a temporary file selected by Lark, based on hash
cache="filename" will use that path for saving the file, still considering hash to prevent versioning issues (Also accepts Path objects).
Do you think the first is better, or this one?
Another set of values is also possible, and perhaps preferable:
- cache=False will disable cache
- cache=True (default) will use a temporary file selected by Lark, based on hash
- cache="filename" will use that path for saving the file, still considering hash to prevent versioning issues (Also accepts Path objects).
Do you think the first is better, or this one?
I think it's better this one, it would make a slightly cleaner code for the user.
I would mix them and use False | True | [Path object]
. I am thinking of type checking, and I feel like Union[bool, Path]
would be more powerful than Optional[str]
.
In any case, I like the idea, I was going through the list of issues to check exactly for that :)
Any idea what sort of performance improvement to expect?
If anyone wants to cache the parser until the feature is in, here we cache the lark instance at runtime using functools.cached_property
in Python 3.8, or the following decorator in Python 3.7:
class cached_property:
""" A property that is only computed once per instance and then replaces
itself with an ordinary attribute. Deleting the attribute resets the
property.
"""
def __init__(self, func):
self.__doc__ = func.__doc__
self.func = func
def __get__(self, obj, cls):
if obj is None:
return self
value = obj.__dict__[self.func.__name__] = self.func(obj)
return value
This replaces a method by its return value on the first call, caching the property. You can see it in action here: https://github.com/Scony/godot-gdscript-toolkit/blob/master/gdtoolkit/parser/parser.py
@NathanLovato This issue refers caching the grammar onto permanent storage (i.e. filesystem), so that starting it in a new process won't require any grammar construction.
@erezsh I got it, but in the meantime, I thought I'd share this, it can be useful for people looking to cache at runtime at the moment, while still decoupling their code.
I tried caching to disk with pickle first, then serializing through Lark and dumping the data as JSON, but:
Lark.deserialize
method gives an error on load. I took the standalone.py
file as an example, not sure if the approach is correct.Below, the parser
argument is a Lark object built as:
Lark.open(
grammar,
parser="lalr",
start="start",
propagate_positions=add_metadata,
maybe_placeholders=False,
)
def save(self, parser: Tree, path: str) -> None:
"""Serializes the Lark parser and saves it to the disk."""
data, memo = parser.memo_serialize([TerminalDef, Rule])
write_data: dict = {
"data": data,
"memo": memo,
}
dirpath: str = os.path.dirname(path)
if not os.path.exists(dirpath):
os.makedirs(dirpath)
with open(path, "w", encoding="utf8") as file_parser:
json.dump(write_data, file_parser)
def load(self, path: str) -> Tree:
"""Loads the Lark parser from the disk and deserializes it."""
with open(path, "r", encoding="utf8") as file_parser:
data: dict = json.load(file_parser)
namespace = {"Rule": Rule, "TerminalDef": TerminalDef}
# TODO: write a class to serialize Indenter as json as well
Lark.deserialize(
data["data"],
namespace,
data["memo"],
transformer=None,
postlex=Indenter(),
)
The error:
...
File "/home/gdquest/.local/lib/python3.7/site-packages/gdtoolkit/parser/parser.py", line 160, in load
postlex=Indenter(),
File "/home/gdquest/.local/lib/python3.7/site-packages/lark/lark.py", line 244, in <listcomp>
inst.rules = [Rule.deserialize(r, memo) for r in data['rules']]
File "/home/gdquest/.local/lib/python3.7/site-packages/lark/utils.py", line 103, in deserialize
return memo[data['@']]
KeyError: 230
Am I doing something wrong there? Or should I open an issue with a minimal example to reproduce this?
In the project I'm contributing to, computing one lark parser takes ~.5s on my computer, which is pretty fast. As we're using it for a code formatter and linter, it'd be ideal if we could spawn the process almost instantly, to run it on individual buffers in a code editor. And caching would be simpler than turning the program into a daemon.
@NathanLovato Your code won't work, because json
distorts dictionaries (e.g. turning int keys into strings and causing KeyErrors)
It's better to use pickle
.
I just pushed a commit to master
, that provides the load
and save
methods which accept a file object.
Example usage:
with open('a.tmp', 'wb') as f:
calc_parser.save(f)
with open('a.tmp', 'rb') as f:
parser = Lark.load(f)
Ah, I see! And I didn't try to pickle just the serialized data instead of the full object. Thank you for your time and help. 🙂
I'm happy to help. Let me know if these functions work for you. I plan to keep them for the next minor version, and build the cache feature on top of them.
Thanks much! Yes it's working well. For our project I've had to replicate your code for now, as we use a locked version of lark at the moment, but when the release is out we'll upgrade :)
Lark supports grammar caching since v0.8.6.
This is straight-forward, and such a feature already exists in PLY, for example. This would be especially beneficial for LALR.
It should be fairly easy to write, since the
Lark
class already supports serialization and deserialization. All that's required is to add an interface for saving/loading it to some temporary file.