Closed DemiMarie closed 1 year ago
html5ever does not have this problem, so this is specific to Gumbo and not to the HTML parsing algorithm.
$ python -c 'print "<span>" * 10000000 + "a"' | /bin/time ~/repos/rust/html5ever/target/release/examples/arena
5.71user 0.39system 0:06.23elapsed 97%CPU (0avgtext+0avgdata 1486636maxresident)k
2304inputs+0outputs (5major+387234minor)pagefaults 0swaps
My guess this is related to using realloc to keep the required stack of open elements. I will try to profile this more to see what is going on.
performance sampling indicates the problem is parser_add_parse_error which is invoked each time a nested span is encountered. As part of a single error being added a copy of the entire open_elements list is created and recorded along with other information. This leads to quadratic performance as each level of span nesting recreates the entire nested list up to that point.
I am not sure if error reporting is part of the spec, but if not, we should be able to easily fix this by keeping only a limited number of tail elements of the open element list for each error. Or disable parse error reporting and recording by default.
Can confirm that changing parser.c kGumboDefaultOptions max_errors to 50 from -1 (unlimited) makes this issue go away completely. Please note a named your test program quadtime.c for the following:
with kGumboDefaultOptions max_errors = -1; (the current unlimited)
kbhend$ time ./quadtime 10000 real 0m0.982s user 0m0.706s sys 0m0.266s
kbhend$ time ./quadtime 20000 real 0m3.912s user 0m2.934s sys 0m0.975s
kbhend$ time ./quadtime 30000 real 0m9.228s user 0m7.048s sys 0m2.178s
kbhend$ time ./quadtime 40000 real 0m18.833s user 0m14.418s sys 0m4.405s
When max_errors in parser.c is changed to be a more reasonable 50 first errors, the impact on your quadtime.c is huge:
with kGumboDefaultOptions max_errors = 50:
kbhend$ time ./quadtime 10000 real 0m0.022s user 0m0.010s sys 0m0.003s
kbhend$ time ./quadtime 20000 real 0m0.027s user 0m0.021s sys 0m0.004s
kbhend$ time ./quadtime 30000 real 0m0.039s user 0m0.031s sys 0m0.006s
kbhend$ time ./quadtime 40000 real 0m0.053s user 0m0.043s sys 0m0.008s
So to prevent this issue for default users, a kGumboDefaultOptions max_errors field (see parser.c) should be changed to something more reasonable to prevent performance degradation and high memory usage.
While investigating ways to trigger #387, I found that if there are many consecutive unclosed tags followed by EOF, Gumbo will consume quadratic time and memory. The following C program demonstrates this.
On my system, passing a mere 20000 to this program causes it to consume multiple gigs of memory: