google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 663 forks source link

Quadratic time and memory use on very deep trees #393

Closed DemiMarie closed 1 year ago

DemiMarie commented 7 years ago

The following program takes O(n^2) time (but not memory) for very large arguments. I suspect one needs #392 applied to notice (otherwise, one hits a stack overflow first).

#include <ctype.h>
#include <gumbo.h>
#include <stdint.h>
#include <limits.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

int main(int argc, char **argv) {
  if (argc != 2) {
    fputs("Must have exactly one argument: depth of tree to make\n", stderr);
    return 1;
  }
  char *endp;
  unsigned long long q = strtoull(argv[1], &endp, 0);
  const size_t maxarg = (SIZE_MAX - 9) / sizeof "<i>";
  if (q > maxarg) {
    fprintf(stderr, "Argument too large!  Must be at or below %zu\n", maxarg);
    exit(1);
  }
  if (*endp) {
    fprintf(stderr, "Trailing junk at end of argument!\n");
    exit(1);
  }
  char *buf = malloc(q * (2*sizeof "<i>" - 1) + sizeof "<i>\n");
  if (!buf) {
    fputs("Out of mem allocating buffer!\n", stderr);
    return 1;
  }
  char *bufptr = buf;
  memcpy(bufptr, "<i>\n", 4);
  bufptr += 4;
  for (size_t i = 0; i < q; ++i) {
    memcpy(bufptr, "<i>", 3);
    bufptr += 3;
  }
  for (size_t i = 0; i < q; ++i) {
    memcpy(bufptr, "</i>", 3);
    bufptr += 4;
  }
  GumboOptions options;
  memcpy(&options, &kGumboDefaultOptions, sizeof options);
  options.max_errors = 50;
  GumboOutput *output = gumbo_parse_with_options(&options, buf, strlen(buf));
  if (!output) {
    fputs("Out of mem while parsing!\n", stderr);
    free(buf);
    return 1;
  }
  gumbo_destroy_output(&kGumboDefaultOptions, output);
  free(buf);
  return 0;
}

perf reveals that the time is spent in reconstruct_active_formatting_elements.

DemiMarie commented 7 years ago

Running perf on a debug build reveals that the problem is that is_open_element is O(n) and called O(n) times.

kevinhendricks commented 7 years ago

Yes, is_open_element simply walks a vector (ordered) of open elements (effectively a stack of open element nodes) to see if a specific GumboNode is present. To change its O(n) behaviour, you would need to use a much more complex data structure than a gumbo_vector that either keeps an associated hash table of node contents or at the very least keeps a sorted node order variant in parallel to allow binary search which simply increases the memory footprint.

Google spent a large amount of time benchmarking this parser against actual code on the web (millions of pages) and very very deep nesting is just not that common in real html usage. In other words the number of open elements at any point in the parsing is typically quite small making linear search over a handful of nodes in a vector the common use case. I remember an early discussion of those performance tuning stats in either one of the pull requests or issues, but I can't remember which.

IMO, handling your "crafted" webpages using high levels of nesting, gumbo-parser should never allow for stack overrun but at the same time, these types of pages should never be used to determine typical performance/design tradeoffs. Using a "smarter" data structure to search what might be 3 to 10 elements in the most common case, will simply expand the memory footprint and code complexity without much actual gain in typical use cases. Since memory is not an issue here, neither out-of-memory nor stack overrun (security) are issues.

DemiMarie commented 7 years ago

I agree with your assessment. The problem is that Gumbo is used in server-side tools such as HTML sanitizers. For such tools, this is a denial of service security vulnerability.

One option would be to impose an arbitrary nesting limit, and fail the parse if the nesting level is too deep.

On Jul 25, 2017 9:53 AM, "Kevin Hendricks" notifications@github.com wrote:

Yes, is_open_element simply walks a vector (ordered) of open elements (effectively a stack of open element nodes) to see if a specific GumboNode is present. To change its O(n) behaviour, you would need to use a much more complex data structure than a gumbo_vector that either keeps an associated hash table of node contents or at the very least keeps a sorted node order variant in parallel to allow binary search which simply increases the memory footprint.

Google spent a large amount of time benchmarking this parser against actual code on the web (millions of pages) and very very deep nesting is just not that common in real html usage. In other words the number of open elements at any point in the parsing is typically quite small making linear search over a handful of nodes in a vector the common use case. I remember an early discussion of those performance tuning stats in either one of the pull requests or issues, but I can't remember which.

IMO, handling your "crafted" webpages using high levels of nesting, gumbo-parser should never allow for stack overrun but at the same time, these types of pages should never be used to determine typical performance/design tradeoffs. Using a "smarter" data structure to search what might be 3 to 10 elements in the most common case, will simply expand the memory footprint and code complexity without much actual gain in typical use cases. Since memory is not an issue here, neither out-of-memory nor stack overrun (security) are issues.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/google/gumbo-parser/issues/393#issuecomment-317744467, or mute the thread https://github.com/notifications/unsubscribe-auth/AGGWB4rTCIpLW8k4DtlR1v5jQNqZsnaiks5sRfNggaJpZM4OiCSw .

kevinhendricks commented 7 years ago

Yes a nesting limit makes sense for your use case of a sanitizer. Do you then serialize the sanitized html page and feed it upstream or do you work with the created gumbo node-tree directly? If you serialize, then your non-recurisve tree traversal PR addition would be useful for serialization as if it matches the normal recursive order used by most serializers.

If this is also something you are going to address with a PR, please think about making that limit be controllable via the Gumbo Parser Options and work similar to how max_errors works (-1 for unlimited nesting, nesting limit otherwise).

DemiMarie commented 7 years ago

On further thought, I don’t know if a nesting limit is a viable option here. The problem is this: suppose the nesting limit is exceeded. Now what? Most code that calls Gumbo (that I have observed) doesn’t check the return value. Returning NULL is thus not an option, except perhaps if Gumbo runs out of memory, since it will probably result in a segfault (fortunately not exploitable). So Gumbo must return something no matter what.

Why is is_open_element even a function? It seems to me that this should be a field in GumboElement.

kevinhendricks commented 7 years ago

The spec requires keeping a list of all open elements. A Gumbo element can only be on the list if it has a start tag and no end tag (or implied end tag) has happened yet during parsing. This info is used for the routine that walks the list of open elements to try to find an appropriate parent when tag nesting order is mismatched (blah) ie. used by adoption agency (among others).

The parse tree does not have to abort and return null, it can simply stop adding to the tree when the nesting limit is reached, similar to how it can stop parsing once a set number of errors have happened and return the partial tree as it stands up to the point the nesting limit was reached.

On Jul 25, 2017, at 5:55 PM, Demi Marie Obenour notifications@github.com wrote:

On further thought, I don’t know if a nesting limit is a viable option here. The problem is this: suppose the nesting limit is exceeded. Now what? Most code that calls Gumbo (that I have observed) doesn’t check the return value. Returning NULL is thus not an option, except perhaps if Gumbo runs out of memory, since it will probably result in a segfault (fortunately not exploitable). So Gumbo must return something no matter what.

Why is is_open_element even a function? It seems to me that this should be a field in GumboElement.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or mute the thread.

kevinhendricks commented 7 years ago

See option->stop_on_error (or something along those lines) in the main parsing loop near the very end of parser.c and how it stops early after a parsing error has been detected if enabled. Something similar could be easily done to test current nesting limits and aborting further parsing.

kevinhendricks commented 7 years ago

Also is_open_element is a function of the parser state when parsing, not a gumbo node element state (ie. by the time parsing is complete, no elements would be on the is_open_element list and thus not something that shoud be stored inside an element itself.

kevinhendricks commented 7 years ago

That said ... I see now there already is a parser flags field in a GumboNode. You could "extend/abuse" the parser flags enum to add a "currently_in_open_elements" (private) flag bit and set and unset it as needed when adding/popping a node from open_elements without changing the abi. Then you could use it to implement your idea of replacing the O(n) test is_open_element with a quick node parser flag bit test.

kevinhendricks commented 7 years ago

@DemiMarie If running a sanitizer you might also want to look at PR 384 to detect and prevent overflow with long numeric entities: https://github.com/google/gumbo-parser/pull/384