sirthias / pegdown

A pure-Java Markdown processor based on a parboiled PEG parser supporting a number of extensions
http://pegdown.org
Apache License 2.0
1.29k stars 218 forks source link

Packrat Parsrers resolve exponential parsing time #248

Open kmizu opened 7 years ago

kmizu commented 7 years ago

Hi, I studied Packrat Parsing/PEG a bit. As I saw exponential parsing time, I though that it maybe resolved by packrat parsing.

To emulate packrat parsing, I added @MemoMismatches to all Rule returning methods that take no argument.

As a result, the exponential parsing time problems such as #43 and #104 were resolved in my machine. Actually, it maybe OK if the number of @MemoMismatches methods decreases.

kmizu commented 7 years ago

I'm sorry. Some tests failed by adding too much @MemoMismatches. I'll fix the problem. However, it is effective to reduce parsing time.

vsch commented 7 years ago

@kmizu, it may be effective in some cases but not in most. The parsing time issue and exponential parsing time is rooted in the PEG architecture not in the implementation of it. There are too many possibilities when parsing markdown to be able to prune them early so the possible parsing rules keep growing and with it the parsing time.

No matter what you do you will not be able to eliminate all the pathological input cases, because you have to identify them first and that is a career decision all by itself. For example I found that a few dozen repeated [ is enough to take a coffee break before the parser is done parsing. On the other hand commonmark-java and my project flexmark-java which started as a fork of commonmark-java, which use the CommonMark blocks first inlines second parsing method for markdown, use a 100,000 x [ as a pathological performance test, that completes in under 100ms.

You can mitigate some parsing issues by adding constructs and optimizations on top of PEG but the inherent shortcomings of using any regular grammar for Markdown, not just PEG, cannot be resolved in such fashion. Markdown is not a regular language.

A tool has to be applied to tasks it is suited for. PEG is not a suitable tool for markdown parsing and pegdown has pushed the envelope what could be done with PEG for markdown parsing.

Pegdown was intimately integrated into my plugin project and when I had to make the hard decision to switch parsers I spent some time looking for alternatives. I chose commonmark-java for its amazing speed, ease of maintenance, ease of tweaking and debugging of the parsing rules. Its rules are just java code, easily debugged unlike PEG.

commonmark-java was not at all suited to replace pegdown but I made the decision based on its architecture and implementation. I already had enough experience fixing and adding to pegdown to know what to avoid in the next parser. In the end it was well worth it. Average file parsing performance is 30x faster than pegdown. No pathological cases, no timeouts, easy to tweak parsing rules per element since they have minimal interdependencies, unlike one big PEG grammar.