Closed w4454962 closed 1 year ago
It takes 4 seconds for pegtl to parse 300000 lines of 27mb jass syntax tree
https://github.com/w4454962/jass-parser
luajit + lpeg parsing 300000 lines of 27mb jass syntax tree takes 2 seconds
https://github.com/w4454962/jass-parser-luajit
It only takes 0.7 seconds to parse the same content using the same rules of bison/yacc
Is there a way to implement a pegtl version that is close to the running efficiency of the bison / yacc scheme?
It looks like you just created a static stack to avoid allocations. Have you tried to use a cache like in our example at https://github.com/taocpp/PEGTL/blob/3fbf6721e8d90fb737f48f6b1c88aa782e2cd4b7/src/example/pegtl/parse_tree.cpp#L99-L160
As for the comparison to luajit + lpeg or even to bison/yacc: I don't think we will ever be that fast with our parse tree. And I don't have time to look into this more, sorry.
OK, I really ignored the static cache code in the demo. It already exists and is the same as my idea. Thank you for your reply. If the speed can't be faster, we can only give up pegtl
The PEGTL is highly modular and there are many things that can be done to improve parsing performance.
However the parse_tree
included with the PEGTL is more of an educational example that shows how to parse into a certain kind of data structure built around standard (template) library facilities, and less of a highly optimised way to build a tree when speed is important.
One important thing to keep an eye on if you care about performance that is independent of the data structure you generate while parsing is to keep the grammar as simple as possible, and in particular to tune it for minimal backtracking (seeing as backtracking always implies wasted work).
For example you have several rules that are implemented as sor< one< A, B, ... > >
instead of the equivalent one< A, B, ... >
. Now this particular case is handled efficiently by the PEGTL and does not produce a run-time overhead, however every superfluous sub-rule in the grammar makes it more complicated to parse (no pun intended) and reason about as a human.
Similarly you frequently use star< seq< A, B, ... > >
instead of the equivalent, and shorter, star< A, B, ... >
(star
has an implicit seq
), unless of course you really need the seq
to attach actions to, or for parse_tree
nodes. There are other examples of this kind of potentially superfluous rules in your grammar.
Then there is the way you parse expressions, where the space
rule before the operators will typically be parsed multiple times before the right one, as determined by the operator after the space
, was found.
And another thing to look at is whether the sub-rules of an sor
are, if possible, ordered from most likely to least likely (as they will be attempted to match in the order listed in the grammar).
Then there is the possibly best chance of improving performance, bypassing the generic parse_tree
and directly parsing into a more "strongly typed" data structure that minimises heap allocations. I can't say more about this without spending significantly more time with your code.
The PEGTL is highly modular and there are many things that can be done to improve parsing performance.
However the
parse_tree
included with the PEGTL is more of an educational example that shows how to parse into a certain kind of data structure built around standard (template) library facilities, and less of a highly optimised way to build a tree when speed is important.One important thing to keep an eye on if you care about performance that is independent of the data structure you generate while parsing is to keep the grammar as simple as possible, and in particular to tune it for minimal backtracking (seeing as backtracking always implies wasted work).
For example you have several rules that are implemented as
sor< one< A, B, ... > >
instead of the equivalentone< A, B, ... >
. Now this particular case is handled efficiently by the PEGTL and does not produce a run-time overhead, however every superfluous sub-rule in the grammar makes it more complicated to parse (no pun intended) and reason about as a human.Similarly you frequently use
star< seq< A, B, ... > >
instead of the equivalent, and shorter,star< A, B, ... >
(star
has an implicitseq
), unless of course you really need theseq
to attach actions to, or forparse_tree
nodes. There are other examples of this kind of potentially superfluous rules in your grammar.Then there is the way you parse expressions, where the
space
rule before the operators will typically be parsed multiple times before the right one, as determined by the operator after thespace
, was found.And another thing to look at is whether the sub-rules of an
sor
are, if possible, ordered from most likely to least likely (as they will be attempted to match in the order listed in the grammar).Then there is the possibly best chance of improving performance, bypassing the generic
parse_tree
and directly parsing into a more "strongly typed" data structure that minimises heap allocations. I can't say more about this without spending significantly more time with your code.
I have greatly optimized the upper logic code, but the speed has not been improved by an order of magnitude. It is only accelerated from 4.2s to 4.0s. It may be more necessary to optimize pegtl itself or more deeply optimize parse_ tree
And if you use the cache method demonstrated by pegtl, it will be 1s slower than my cache code, I don't know how to design a more efficient "parse_tree", which seems to have reached the bottleneck stage.
The idea is not to make the parse tree as fast as possible, but to create a more specific and optimised data structure while parsing, one that minimises memory allocations and wasted work...
I wrote a script language parser half a year ago. At that time, it took 12 seconds to parse 300000 lines of 27mb code. The efficiency was too low. I tried to optimize parse_tree yesterday, the efficiency is increased by 200%, and it only takes 4 seconds to complete the parsing. Although the efficiency cannot meet my requirements, this optimization scheme should be told to you
parse_tree2.zip