lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.92k stars 417 forks source link

New feature: Cache grammars for fast initialization of Lark #479

Closed erezsh closed 4 years ago

erezsh commented 5 years ago

This is straight-forward, and such a feature already exists in PLY, for example. This would be especially beneficial for LALR.

It should be fairly easy to write, since the Lark class already supports serialization and deserialization. All that's required is to add an interface for saving/loading it to some temporary file.

erezsh commented 5 years ago

Proposed interface:

MegaIng commented 5 years ago

It would be important to include the lark version inside the hash to make sure that lark doesn't try to load an incompatible parser.

erezsh commented 5 years ago

Another set of values is also possible, and perhaps preferable:

Do you think the first is better, or this one?

giuliano-macedo commented 5 years ago

Another set of values is also possible, and perhaps preferable:

  • cache=False will disable cache
  • cache=True (default) will use a temporary file selected by Lark, based on hash
  • cache="filename" will use that path for saving the file, still considering hash to prevent versioning issues (Also accepts Path objects).

Do you think the first is better, or this one?

I think it's better this one, it would make a slightly cleaner code for the user.

laurentS commented 4 years ago

I would mix them and use False | True | [Path object]. I am thinking of type checking, and I feel like Union[bool, Path] would be more powerful than Optional[str]. In any case, I like the idea, I was going through the list of issues to check exactly for that :)

Any idea what sort of performance improvement to expect?

NathanLovato commented 4 years ago

If anyone wants to cache the parser until the feature is in, here we cache the lark instance at runtime using functools.cached_property in Python 3.8, or the following decorator in Python 3.7:

class cached_property:
    """ A property that is only computed once per instance and then replaces
        itself with an ordinary attribute. Deleting the attribute resets the
        property.
        """

    def __init__(self, func):
        self.__doc__ = func.__doc__
        self.func = func

    def __get__(self, obj, cls):
        if obj is None:
            return self
        value = obj.__dict__[self.func.__name__] = self.func(obj)
        return value

This replaces a method by its return value on the first call, caching the property. You can see it in action here: https://github.com/Scony/godot-gdscript-toolkit/blob/master/gdtoolkit/parser/parser.py

erezsh commented 4 years ago

@NathanLovato This issue refers caching the grammar onto permanent storage (i.e. filesystem), so that starting it in a new process won't require any grammar construction.

NathanLovato commented 4 years ago

@erezsh I got it, but in the meantime, I thought I'd share this, it can be useful for people looking to cache at runtime at the moment, while still decoupling their code.

I tried caching to disk with pickle first, then serializing through Lark and dumping the data as JSON, but:

Below, the parser argument is a Lark object built as:

Lark.open(
    grammar,
    parser="lalr",
    start="start",
    propagate_positions=add_metadata,
    maybe_placeholders=False,
)
    def save(self, parser: Tree, path: str) -> None:
        """Serializes the Lark parser and saves it to the disk."""

        data, memo = parser.memo_serialize([TerminalDef, Rule])
        write_data: dict = {
            "data": data,
            "memo": memo,
        }

        dirpath: str = os.path.dirname(path)
        if not os.path.exists(dirpath):
            os.makedirs(dirpath)
        with open(path, "w", encoding="utf8") as file_parser:
            json.dump(write_data, file_parser)

    def load(self, path: str) -> Tree:
        """Loads the Lark parser from the disk and deserializes it."""
        with open(path, "r", encoding="utf8") as file_parser:
            data: dict = json.load(file_parser)
            namespace = {"Rule": Rule, "TerminalDef": TerminalDef}
            # TODO: write a class to serialize Indenter as json as well
            Lark.deserialize(
                data["data"],
                namespace,
                data["memo"],
                transformer=None,
                postlex=Indenter(),
            )

The error:

...
  File "/home/gdquest/.local/lib/python3.7/site-packages/gdtoolkit/parser/parser.py", line 160, in load
    postlex=Indenter(),
  File "/home/gdquest/.local/lib/python3.7/site-packages/lark/lark.py", line 244, in <listcomp>
    inst.rules = [Rule.deserialize(r, memo) for r in data['rules']]
  File "/home/gdquest/.local/lib/python3.7/site-packages/lark/utils.py", line 103, in deserialize
    return memo[data['@']]
KeyError: 230

Am I doing something wrong there? Or should I open an issue with a minimal example to reproduce this?

In the project I'm contributing to, computing one lark parser takes ~.5s on my computer, which is pretty fast. As we're using it for a code formatter and linter, it'd be ideal if we could spawn the process almost instantly, to run it on individual buffers in a code editor. And caching would be simpler than turning the program into a daemon.

erezsh commented 4 years ago

@NathanLovato Your code won't work, because json distorts dictionaries (e.g. turning int keys into strings and causing KeyErrors)

It's better to use pickle.

I just pushed a commit to master, that provides the load and save methods which accept a file object.

Example usage:

    with open('a.tmp', 'wb') as f:
        calc_parser.save(f)
    with open('a.tmp', 'rb') as f:
        parser = Lark.load(f)
NathanLovato commented 4 years ago

Ah, I see! And I didn't try to pickle just the serialized data instead of the full object. Thank you for your time and help. 🙂

erezsh commented 4 years ago

I'm happy to help. Let me know if these functions work for you. I plan to keep them for the next minor version, and build the cache feature on top of them.

NathanLovato commented 4 years ago

Thanks much! Yes it's working well. For our project I've had to replicate your code for now, as we use a locked version of lark at the moment, but when the release is out we'll upgrade :)

erezsh commented 4 years ago

Lark supports grammar caching since v0.8.6.