mity / md4c

C Markdown parser. Fast. SAX-like interface. Compliant to CommonMark specification.
MIT License
787 stars 146 forks source link

Parsing ‘a***a***a***a***…’ takes quadratic time #63

Closed andersk closed 5 years ago

andersk commented 5 years ago
$ python -c 'print("a***" * 20000)' | time md2html > /dev/null
0.92user 0.00system 0:00.94elapsed 97%CPU (0avgtext+0avgdata 3080maxresident)k
0inputs+0outputs (0major+463minor)pagefaults 0swaps
$ python -c 'print("a***" * 40000)' | time md2html > /dev/null
3.83user 0.00system 0:03.87elapsed 98%CPU (0avgtext+0avgdata 4240maxresident)k
0inputs+0outputs (0major+775minor)pagefaults 0swaps
$ python -c 'print("a***" * 80000)' | time md2html > /dev/null
22.21user 0.02system 0:22.46elapsed 98%CPU (0avgtext+0avgdata 7028maxresident)k
0inputs+0outputs (0major+1441minor)pagefaults 0swaps
mity commented 5 years ago

@jgm Maybe you can thank me for https://github.com/commonmark/cmark/issues/284 by explaining how is it possible that Cmark is not vulnerable to this issue? ;-)

MD4C walks the dangling openers for each *** but all of them fail because of the rule of three. So after it all fails, it is added as just another dangling opener, and then we do it all again for next ***.

jgm commented 5 years ago

@mity in cmark we keep a list of "lower bounds" for openers that might match each possible type of closer (e.g., asterisk span with length mod 3 == 0). So once we've rejected one opening span as a possible opener for a closer consisting of 3 asterisks, we won't need to look at it again. Code is around here.

mity commented 5 years ago

@jgm I will have to do it a bit differently due to some internal limitations. But it helped a lot, thanks.

mity commented 5 years ago

@andersk Fixed it all, finally. Thank you for the bug hunt. I will eventually release 0.3.1 with all those fixes after some afl-fuzzing session. That is, unless you have something more queued for me? ;-)