google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 663 forks source link

Quadratic time and memory use with many unclosed tags #391

Closed DemiMarie closed 1 year ago

DemiMarie commented 7 years ago

While investigating ways to trigger #387, I found that if there are many consecutive unclosed tags followed by EOF, Gumbo will consume quadratic time and memory. The following C program demonstrates this.

#include <ctype.h>
#include <gumbo.h>
#include <stdint.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv) {
  if (argc != 2) {
    fputs("Must have exactly one argument: depth of tree to make\n", stderr);
    return 1;
  }
  char *endp;
  unsigned long long q = strtoull(argv[1], &endp, 0);
  const size_t maxarg = (SIZE_MAX - 2) / sizeof "<span";
  if (q > maxarg) {
    fprintf(stderr, "Argument too large!  Must be at or below %zu\n", maxarg);
    exit(1);
  }
  if (*endp) {
    fprintf(stderr, "Trailing junk at end of argument!\n");
    exit(1);
  }
  char *buf = malloc(q * (sizeof "<span>" - 1) + 2);
  if (!buf) {
    fputs("Out of mem allocating buffer!\n", stderr);
    return 1;
  }
  char *bufptr = buf;
  for (size_t i = 0; i < q; ++i) {
    memcpy(bufptr, "<span>", 6);
    bufptr += 6;
  }
  *bufptr++ = 'a';
  *bufptr = '\0';
  GumboOutput *output = gumbo_parse(buf);
  if (!output) {
    fputs("Out of mem while parsing!\n", stderr);
    free(buf);
    return 1;
  }
  gumbo_destroy_output(&kGumboDefaultOptions, output);
  free(buf);
  return 0;
}

On my system, passing a mere 20000 to this program causes it to consume multiple gigs of memory:

$ /bin/time ./a.out 5000
0.13user 0.04system 0:00.18elapsed 98%CPU (0avgtext+0avgdata 198192maxresident)k
984inputs+0outputs (1major+49333minor)pagefaults 0swaps
$ /bin/time ./a.out 10000
0.57user 0.27system 0:00.84elapsed 99%CPU (0avgtext+0avgdata 986364maxresident)k
0inputs+0outputs (0major+258636minor)pagefaults 0swaps
$ /bin/time ./a.out 20000
2.65user 0.97system 0:03.68elapsed 98%CPU (0avgtext+0avgdata 3732612maxresident)k
0inputs+0outputs (0major+955997minor)pagefaults 0swaps
$ /bin/time ./a.out 40000
17.21user 3.61system 0:21.29elapsed 97%CPU (0avgtext+0avgdata 6589632maxresident)k
3744inputs+0outputs (53major+3160745minor)pagefaults 0swaps
$  /bin/time ./a.out 80000
23.56user 3.76system 0:27.84elapsed 98%CPU (0avgtext+0avgdata 6656248maxresident)k
28376inputs+0outputs (146major+3497943minor)pagefaults 0swaps
DemiMarie commented 7 years ago

html5ever does not have this problem, so this is specific to Gumbo and not to the HTML parsing algorithm.

$ python -c 'print "<span>" * 10000000 + "a"' | /bin/time ~/repos/rust/html5ever/target/release/examples/arena 
5.71user 0.39system 0:06.23elapsed 97%CPU (0avgtext+0avgdata 1486636maxresident)k
2304inputs+0outputs (5major+387234minor)pagefaults 0swaps
kevinhendricks commented 7 years ago

My guess this is related to using realloc to keep the required stack of open elements. I will try to profile this more to see what is going on.

kevinhendricks commented 7 years ago

performance sampling indicates the problem is parser_add_parse_error which is invoked each time a nested span is encountered. As part of a single error being added a copy of the entire open_elements list is created and recorded along with other information. This leads to quadratic performance as each level of span nesting recreates the entire nested list up to that point.

I am not sure if error reporting is part of the spec, but if not, we should be able to easily fix this by keeping only a limited number of tail elements of the open element list for each error. Or disable parse error reporting and recording by default.

kevinhendricks commented 7 years ago

Can confirm that changing parser.c kGumboDefaultOptions max_errors to 50 from -1 (unlimited) makes this issue go away completely. Please note a named your test program quadtime.c for the following:

with kGumboDefaultOptions max_errors = -1; (the current unlimited)

kbhend$ time ./quadtime 10000 real 0m0.982s user 0m0.706s sys 0m0.266s

kbhend$ time ./quadtime 20000 real 0m3.912s user 0m2.934s sys 0m0.975s

kbhend$ time ./quadtime 30000 real 0m9.228s user 0m7.048s sys 0m2.178s

kbhend$ time ./quadtime 40000 real 0m18.833s user 0m14.418s sys 0m4.405s

When max_errors in parser.c is changed to be a more reasonable 50 first errors, the impact on your quadtime.c is huge:

with kGumboDefaultOptions max_errors = 50:

kbhend$ time ./quadtime 10000 real 0m0.022s user 0m0.010s sys 0m0.003s

kbhend$ time ./quadtime 20000 real 0m0.027s user 0m0.021s sys 0m0.004s

kbhend$ time ./quadtime 30000 real 0m0.039s user 0m0.031s sys 0m0.006s

kbhend$ time ./quadtime 40000 real 0m0.053s user 0m0.043s sys 0m0.008s

So to prevent this issue for default users, a kGumboDefaultOptions max_errors field (see parser.c) should be changed to something more reasonable to prevent performance degradation and high memory usage.