lark-parser / lark

Lark is a parsing toolkit for Python, built with a focus on ergonomics, performance and modularity.
MIT License
4.86k stars 413 forks source link

Optimization: Merge anonymous terminals #221

Open erezsh opened 6 years ago

erezsh commented 6 years ago

It's possible to improve Lark's speed, for all algorithms, by performing grammar simplification. Specifically, it's possible to merge adjacent anonymous terminals together, to reduce the parser work load, without affecting the resulting tree (An example of this technique can be seen in https://github.com/lark-parser/lark/issues/218)

In the same vein, ignored terminals can be joined to anonymous ones, with a similar effect.

MegaIng commented 5 years ago

I don't thing this is (always) true. There are two different reasons for this: !rule and %ignore:

start: "a" "b"
%ignore " "

will not match the same things as

start: "ab"
%ignore " "

The same goes for !rule:

 !start: "a" "b"

will not return the same tree as:

 !start: "ab"
erezsh commented 5 years ago

Well, for !rule, the optimization could be performed, and then reversed in post-processing by parsing the joined string. But that's probably not going to help performance.

As for %ignore, that's pretty simple:

start: "a" "b"
%ignore " "

Becomes

start: _JOINED
_JOINED: "a" " "* "b"
%ignore " "
MegaIng commented 5 years ago

On the same note, regarding %ignore and terminals, should this behave like this:

 start: TEST
 TEST: "a" "b"
  %ignore " "

will not match "a b". I know why, but this is a huge difference between terminals and rules. To correct this issue, you would have to already the post-processing splitter inplace. Probably not worth it.