google / gumbo-parser

An HTML5 parsing library in pure C99
Apache License 2.0
5.16k stars 660 forks source link

Proposal: Remove the custom memory allocators and replace with an arena #312

Closed nostrademons closed 1 year ago

nostrademons commented 9 years ago

I've got a partial patch in #309 to replace Gumbo's custom memory allocators with an arena stored in the GumboOutput structure. Seeking input on situations where this may have unintended consequences.

Concrete proposal

  1. Eliminate the allocator/deallocator/userdata fields from GumboOptions.
  2. Add an 'arena' member to GumboOutput. This is an opaque struct that is destroyed with gumbo_output_destroy, and contains all memory used by the library.
  3. Internally, the arena is a linked list of chunks. They're currently sized at 240K to fit entirely within the L2 cache of recent Intel processors. Allocation consists of a pointer bump, with a new chunk being allocated if there is no space in the current one. Deallocation is a no-op, with the full list of chunks being destroyed by gumbo_output_destroy.
  4. Optionally, we can now remove the GumboOptions argument to gumbo_output_destroy.

    Current workaround

Currently Gumbo allows the usage of custom allocators (including arenas), but defaults to a simple wrapper around malloc. It's possible and quite easy to implement an arena with Gumbo, but design trade-offs in how Gumbo allocates memory differ based on what memory allocator is used. As a result, users of the default allocator get reduced performance to support the possibility of an arena, while users of an arena get increased memory traffic to support the system malloc default. It may be simpler for users to remove the choice and instead provide good performance out-of-the-box with an arena.

Benefits

  1. Measured performance increase of roughly 20% in CPU time, out of the box.
  2. Smaller API surface for users to understand.
  3. No possibility of memory leaks, and a much reduced possibility of dangling pointers.
  4. Allows us to add proper out-of-memory handling, something I've long wanted. Right now, if malloc fails in gumbo_parser_allocate, the parser derefs a null pointer (or worse, tramples memory). This is because a.) malloc never fails in modern Linux distributions and b.) cleaning up the parser in a way that doesn't leak memory is highly non-trivial, since the parse tree may have been in an inconsistent state when the bad allocation was made. With patch #309, when an arena malloc fails it longjmps back to gumbo_parse, which returns the parse tree that it's parsed up to that point and sets an out_of_memory flag in GumboOutput, and the partial parse tree will be freed when the arena is deallocated in gumbo_destroy_output.
  5. Opens up the possibility of some API changes that were previously deemed too complicated for memory-management reasons. For example, gumbo_normalized_tagname can return a buffer allocated inside the arena that actually normalizes the case (it doesn't right now, because that would require a fresh buffer that must be deallocated), without the client needing to delete it itself. gumbo_destroy_output would no longer need to carry the GumboOptions along.

    Drawbacks

  6. Larger memory usage. A typical HTML document under Gumbo 0.10.0 takes roughly 5K of memory per 1K of document length to parse. The arena implementation roughly doubles that. In absolute numbers, under the arena the median document takes about 720K of memory, with the 95th percentile at 2.4M.
  7. Makes mutability much more difficult. Arenas are not good choices for mutable documents, where there is repeated allocation & freeing. This proposal is incompatible with #311, and would largely lock Gumbo in to a use-case where you either query the parse tree and pull out the information you want, or you wrap it in a data structure of your choosing and throw it away.
  8. Backwards incompatible; it modifies the GumboOptions structure.
  9. Loses the ability to use custom allocators. The main use-case for them in my own programming had always been arenas, but I wonder if anyone is using them to eg. place Gumbo data on a garbage-collected heap? Is anyone using them at all?
  10. How would this affect embedded systems, where RAM may be tighter? Is anyone using Gumbo in an embedded system?

    Compromise solutions

  11. We could allow tweaking of the arena size as part of GumboOptions, hopefully eliminating some memory pressure if you need to parse small HTML documents on an embedded system.
  12. We could keep the custom allocator machinery, but change the default to use an arena allocator and special-case some codepaths for it. This would keep most of the performance benefits, flexibility, and backwards compatibility, but we lose advantages 2-5, increase the internal complexity significantly, and are still faced with performance tradeoffs between arena vs. system malloc optimizations.

Comment with a +1 or -1, or any additional comments or considerations.

sylvinus commented 9 years ago

I'm largely in favor of trading -20% CPU for +100% memory, but maybe a more rational argument can be made with some benchmarks on standard hardware where Gumbo would realistically be used instead of other alternatives? (I don't think embedded systems fall in this category, but I might be wrong)

How about taking a regular Mac/PC and a few cloud instance types from EC2/GCE and seeing what becomes the bottleneck first?

nostrademons commented 9 years ago

Hard to do this without knowing what people are using Gumbo for. I'd initially designed it thinking it'd mostly be used for command-line tools to refactor or lint HTML, but I've since heard of people using it in GUI E-book readers, high-volume server-side webapps, scripting languages, and more. I've used it personally for some big-data passes over a corpus of web pages, and was planning on using it to manipulate WebComponents server-side but ended up scrapping that project. In some of those use-cases, memory may be more precious than others, eg. if you're using it in a server with 100 requests in flight, then 4M/doc works out to an additional 400M. If you're just writing a command-line tool on a desktop, 4M/doc is negligible.

So, in the spirit of gathering data - what are people using Gumbo for, if they don't mind sharing? Server, desktop, command-line, mobile, embedded or other app? Memory-constrained or CPU-constrained? I'm a bit busy right now and probably won't be able to make any changes for a while, but it'll be useful to have the data if it comes time to make trade-offs in the future.

rgrove commented 9 years ago

I use Gumbo in Sanitize, a Ruby library for sanitizing untrusted HTML based on a whitelist. Sanitize is used by many people, but as one example, in my day job at SmugMug, it's used to sanitize HTML fragments that our customers enter into image captions or use to customize their photo sites.

I don't have exact numbers to share, but SmugMug feeds a fairly high volume of small HTML fragments through Sanitize and Gumbo on demand and displays the output to users. In general I'd be willing to trade memory for faster parsing, but I'm happy with Gumbo's current performance for our particular use case.

sylvinus commented 9 years ago

Got it. An interesting question would then be, how long do the pages need to stay in memory compared to the time spent parsing them?

My use case is general web crawling from cloud instances, so we parse as much as we can in parallel, do some analysis on their content and discard them as fast as possible. If 100 concurrent pages can saturate multiple CPUs with 400M of RAM (even the smallest compute-optimized EC2 instance has 3.75GB RAM), that's quite a good ratio, but indeed it remains to be tested.