taocpp / PEGTL

Parsing Expression Grammar Template Library
Boost Software License 1.0
1.95k stars 229 forks source link

parse_tree needs to be optimized #321

Closed w4454962 closed 1 year ago

w4454962 commented 2 years ago

I wrote a script language parser half a year ago. At that time, it took 12 seconds to parse 300000 lines of 27mb code. The efficiency was too low. I tried to optimize parse_tree yesterday, the efficiency is increased by 200%, and it only takes 4 seconds to complete the parsing. Although the efficiency cannot meet my requirements, this optimization scheme should be told to you

parse_tree2.zip

w4454962 commented 2 years ago

It takes 4 seconds for pegtl to parse 300000 lines of 27mb jass syntax tree

https://github.com/w4454962/jass-parser

luajit + lpeg parsing 300000 lines of 27mb jass syntax tree takes 2 seconds

https://github.com/w4454962/jass-parser-luajit

It only takes 0.7 seconds to parse the same content using the same rules of bison/yacc

https://github.com/lep/pjass

w4454962 commented 2 years ago

Is there a way to implement a pegtl version that is close to the running efficiency of the bison / yacc scheme?

d-frey commented 2 years ago

It looks like you just created a static stack to avoid allocations. Have you tried to use a cache like in our example at https://github.com/taocpp/PEGTL/blob/3fbf6721e8d90fb737f48f6b1c88aa782e2cd4b7/src/example/pegtl/parse_tree.cpp#L99-L160

As for the comparison to luajit + lpeg or even to bison/yacc: I don't think we will ever be that fast with our parse tree. And I don't have time to look into this more, sorry.

w4454962 commented 2 years ago

OK, I really ignored the static cache code in the demo. It already exists and is the same as my idea. Thank you for your reply. If the speed can't be faster, we can only give up pegtl

ColinH commented 2 years ago

The PEGTL is highly modular and there are many things that can be done to improve parsing performance.

However the parse_tree included with the PEGTL is more of an educational example that shows how to parse into a certain kind of data structure built around standard (template) library facilities, and less of a highly optimised way to build a tree when speed is important.

One important thing to keep an eye on if you care about performance that is independent of the data structure you generate while parsing is to keep the grammar as simple as possible, and in particular to tune it for minimal backtracking (seeing as backtracking always implies wasted work).

For example you have several rules that are implemented as sor< one< A, B, ... > > instead of the equivalent one< A, B, ... >. Now this particular case is handled efficiently by the PEGTL and does not produce a run-time overhead, however every superfluous sub-rule in the grammar makes it more complicated to parse (no pun intended) and reason about as a human.

Similarly you frequently use star< seq< A, B, ... > > instead of the equivalent, and shorter, star< A, B, ... > (star has an implicit seq), unless of course you really need the seq to attach actions to, or for parse_tree nodes. There are other examples of this kind of potentially superfluous rules in your grammar.

Then there is the way you parse expressions, where the space rule before the operators will typically be parsed multiple times before the right one, as determined by the operator after the space, was found.

And another thing to look at is whether the sub-rules of an sor are, if possible, ordered from most likely to least likely (as they will be attempted to match in the order listed in the grammar).

Then there is the possibly best chance of improving performance, bypassing the generic parse_tree and directly parsing into a more "strongly typed" data structure that minimises heap allocations. I can't say more about this without spending significantly more time with your code.

w4454962 commented 2 years ago

The PEGTL is highly modular and there are many things that can be done to improve parsing performance.

However the parse_tree included with the PEGTL is more of an educational example that shows how to parse into a certain kind of data structure built around standard (template) library facilities, and less of a highly optimised way to build a tree when speed is important.

One important thing to keep an eye on if you care about performance that is independent of the data structure you generate while parsing is to keep the grammar as simple as possible, and in particular to tune it for minimal backtracking (seeing as backtracking always implies wasted work).

For example you have several rules that are implemented as sor< one< A, B, ... > > instead of the equivalent one< A, B, ... >. Now this particular case is handled efficiently by the PEGTL and does not produce a run-time overhead, however every superfluous sub-rule in the grammar makes it more complicated to parse (no pun intended) and reason about as a human.

Similarly you frequently use star< seq< A, B, ... > > instead of the equivalent, and shorter, star< A, B, ... > (star has an implicit seq), unless of course you really need the seq to attach actions to, or for parse_tree nodes. There are other examples of this kind of potentially superfluous rules in your grammar.

Then there is the way you parse expressions, where the space rule before the operators will typically be parsed multiple times before the right one, as determined by the operator after the space, was found.

And another thing to look at is whether the sub-rules of an sor are, if possible, ordered from most likely to least likely (as they will be attempted to match in the order listed in the grammar).

Then there is the possibly best chance of improving performance, bypassing the generic parse_tree and directly parsing into a more "strongly typed" data structure that minimises heap allocations. I can't say more about this without spending significantly more time with your code.

I have greatly optimized the upper logic code, but the speed has not been improved by an order of magnitude. It is only accelerated from 4.2s to 4.0s. It may be more necessary to optimize pegtl itself or more deeply optimize parse_ tree

w4454962 commented 2 years ago

And if you use the cache method demonstrated by pegtl, it will be 1s slower than my cache code, I don't know how to design a more efficient "parse_tree", which seems to have reached the bottleneck stage.

ColinH commented 2 years ago

The idea is not to make the parse tree as fast as possible, but to create a more specific and optimised data structure while parsing, one that minimises memory allocations and wasted work...