lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.81k stars 409 forks source link

Generate parser as Python C-Extension #737

Open sirex opened 3 years ago

sirex commented 3 years ago

Suggestion Lark really makes creating parsers easy, but unfortunately generated parser is very slow. I recently had to debug why things are so slow and found that most of the time is spent on parsing. Hardcoding simple and frequently used cases like _id="UUID" helped to improve performance 10 times. I'm talking not about 10 times better parsing, but overall app performance.

So maybe adding possibility to generate parser as a C-Extension, which would have same interface, would really improve performance? Having Python based parser is nice for prototyping, but eventually it will have to be ported to a lower level language in order to increase performance.

Describe alternatives you've considered Probably updating grammar from Earley to LALR would increase performance, but still having C parser will be a lot faster.

I was looking at pegen, but I'm not sure if it is intended for anything else other than Python language itself.

Other options probably to port grammar from Lark to bison and flex. But that sounds too complicated.

Additional context I'm working on data management project, where I use a small expression language for data transformation. Parser I use can be found here:

https://gitlab.com/atviriduomenys/spinta/-/blob/master/spinta/spyna.py

The man performance issues where in an upsert action, where I upserting a lot of data and each upsert action has a "_where": "_id='UUID'" expression which is parsed with Lark, and this took most of the time.

MegaIng commented 3 years ago

Well, yes for those simple situations lark isn't that good. If you can just use a single regex, use it.

I recently started work on a implementation in nim, instantly giving a 25% speed increase (For a nontrivial example. Your example probably will not benefit that much).

erezsh commented 3 years ago

Probably updating grammar from Earley to LALR would increase performance

Yes, definitely. Actually, it's possible that LALR in Python would be faster than Earley in C.

Looking at your grammar, it looks like it's simple enough, that it should be possible to use LALR.

ThatXliner commented 3 years ago

I have successfully transpiled lark (via nuitka) into C but the common.lark and other metadata files had some trouble catching on the transpiliation.

aviallon commented 3 years ago

I don't know if it is feasible at all, but it might be possible to fully annotate Lark parser's code to then compile it with Cython to hopefully achieve much better performance (http://docs.cython.org/en/latest/src/tutorial/pure.html#static-typing)

Erotemic commented 2 years ago

@aviallon Static typing the python code would be useful, but not for Cythonization. However, I did some profiling of the LALR parser, and it seems one of the biggest time sinks is creating new instances of the Token class, which is a child class of str. Doing this in Cython might be faster than in CPython.

I've written code to solve the "balanced-embedding" problem, which is effectively a string-based dynamic program. I have two variants in pure Python and in Cython. Using Cython gives about a 20-25% speed increase without any major changes between the cython / python variants.

Here are the two files I mentioned: https://gitlab.kitware.com/computer-vision/torch_liberator/-/blob/main/torch_liberator/_nx_ext_v2/balanced_embedding_cython.pyx

https://gitlab.kitware.com/computer-vision/torch_liberator/-/blob/main/torch_liberator/_nx_ext_v2/balanced_embedding.py

Although note that if you use Cython with Python types, you still incur a lot of the interpreter overhead and the compiler wont be able to do much optimization. It would probably be a good idea to start with the Python types, but then swap them out for raw C types, but that might be tricky.

Erotemic commented 2 years ago

Because I apparently can't google: Posting this here for others that are looking for the Lark Cython package:

https://github.com/lark-parser/lark_cython