rubys / nokogumbo

A Nokogiri interface to the Gumbo HTML5 parser.
Apache License 2.0
186 stars 114 forks source link

Limit number of attributes per element to prevent DoS #143

Closed rafbm closed 4 years ago

rafbm commented 4 years ago

Prior to this commit, the time taken to parse elements with an unreasonable number of attributes is non-linear and can lead to excessive memory usage, process hanging and crash.

As a consumer of unsafe HTML, you must enforce some limits on payloads prior to parsing. One easy example is truncating the input string to some max length. However, in a case of extreme number of attributes, there’s no practical way to protect yourself while preserving a decent parsed HTML output (the HTML length isn’t that long by itself). Thus, I believe Nokogumbo should internally enforce a limit after which attributes of a given element will be ignored.

Here’s a script to illustrate the current non-linear work time:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'
  gem 'nokogumbo'
end

require 'nokogumbo'
require 'benchmark'

Benchmark.bm do |x|
  [
    100,
    1_000,
    5_000,
    10_000,
    25_000,
    50_000,
    75_000,
    # 100_000,
  ].each do |attribute_count|
    html = <<~HTML
      <div
        #{attribute_count.times.map { |x| "fake-attr-#{x}" }.join("\n")}
      >
    HTML

    x.report "#{attribute_count.to_s.rjust(7)} attributes" do
      Nokogiri::HTML5(html)
    end
  end
end
       user     system      total        real
    100 attributes  0.000251   0.000024   0.000275 (  0.000273)
   1000 attributes  0.005800   0.000065   0.005865 (  0.005875)
   5000 attributes  0.218430   0.003866   0.222296 (  0.308177)
  10000 attributes  1.066644   0.009984   1.076628 (  1.151923)
  25000 attributes  4.366067   0.010645   4.376712 (  4.380356)
  50000 attributes 18.116970   0.029383  18.146353 ( 18.157078)
  75000 attributes 44.749722   0.067786  44.817508 ( 44.843119)

With the proposed fix, that script becomes:

       user     system      total        real
    100 attributes  0.000249   0.000014   0.000263 (  0.000262)
   1000 attributes  0.001328   0.000001   0.001329 (  0.001328)
   5000 attributes  0.002834   0.000002   0.002836 (  0.002836)
  10000 attributes  0.004665   0.000009   0.004674 (  0.004673)
  25000 attributes  0.010994   0.000026   0.011020 (  0.010997)
  50000 attributes  0.021041   0.000007   0.021048 (  0.021048)
  75000 attributes  0.031382   0.000110   0.031492 (  0.031570)

To determine what the value of Nokogumbo::DEFAULT_MAX_ATTRIBUTES should be, I looked at lists of attributes to have an idea of how many attributes you could technically use on a given element:

https://github.com/wooorm/html-element-attributes/blob/master/index.json https://github.com/wooorm/svg-element-attributes/blob/master/index.json https://github.com/wooorm/aria-attributes/blob/master/index.json

Max number of HTML attributes per element: 58
Max number of SVG attributes per element:  107
Max number of ARIA attributes per element: 49

(script used)

I believe all of these can be combined on the same SVG element. This obviously doesn’t consider data-* attributes, but these are unlimited in theory. So the value we’ll pick will be arbitrary no matter what. I ended up using 400, just because we have Nokogumbo::DEFAULT_MAX_TREE_DEPTH = 400 already. Same number seemed to make some sense.

I assumed we’d want an argument to customize that limit. I went with :max_attributes although it could be clearer with :max_attributes_per_element. But I don’t think the added length is desirable nor necessary.

stevecheckoway commented 4 years ago

Hi,

Thanks for this PR.

Like the max tree depth option, I think this is a fine thing to add. Looking at the code, checking for duplicate attribute names is clearly O(n^2) in the number of names. (This could be amortized constant time using a hash set. It may be worth measuring how much of a slow down that is for the normal case at some point in the future. I'll make a note about it.)

I think hitting this limit should be reported to the user in some fashion as silently dropping attributes seems bad. For hitting the tree depth limit, we raise an argument error. Should we do the same thing here or something else?

rafbm commented 4 years ago

Thanks for the follow up. Now that you mention it, it‘s true that silently ignoring attributes VS raising for exceeded tree depth isn’t consistent.

In our application code base, we actually rescue the ArgumentError raised by Nokogumbo when hitting the max tree depth, and in the rescue block we fallback to parsing with plain Nokogiri, which has a ~256 non-customizable max tree depth behavior that works more like I did :max_attributes here; it silently discards any deeper content. Ultimately as an email client we want to keep and display as much decent HTML as we can to our user, so for any edge case like this we prefer a silent limiting behavior rather than bailing and having to display “This email isn’t rendered because it is too extreme in some way.” I understand other applications might have different expectations.

To be consistent, I can tweak this new code to raise ArgumentError instead of ignoring further attributes. In our app we could rescue it and parse with plain Nokogiri like for the tree depth case. Some custom subclasses of ArgumentError would be nice, like MaxTreeDepthExceeded, MaxAttributesExceeded.

That said, I’m curious to see whether the O(n^2) issue you noticed will make my PR here irrelevant. Let me know how you see it.

rafbm commented 4 years ago

I just tested the script with plain Nokogiri and it suffers the same issue, even worse:

       user     system      total        real
    100 attributes  0.000267   0.000033   0.000300 (  0.000298)
   1000 attributes  0.011261   0.000078   0.011339 (  0.011340)
   5000 attributes  0.297772   0.000709   0.298481 (  0.298574)
  10000 attributes  1.195172   0.002421   1.197593 (  1.198453)
  25000 attributes  7.266606   0.013161   7.279767 (  7.284907)
  50000 attributes 29.401483   0.043176  29.444659 ( 29.468001)
  75000 attributes 65.669765   0.094587  65.764352 ( 65.853030)

So falling back to Nokogiri is not an option here. It would be really ideal to silently ignore attributes past X in our app’s case. I wonder what would be a proper design. Some limits_exceeded: :silence / :raise argument, that would alter the behavior of :max_tree_depth and :max_attributes?

Perhaps if we just heavily increase DEFAULT_MAX_ATTRIBUTES to something like 10000, then silencing is OK for everyone and we don’t actually need the :max_attributes argument?

Here’s what I get with max_attributes: 10_000, quite reasonable:

       user     system      total        real
    100 attributes  0.000252   0.000019   0.000271 (  0.000270)
   1000 attributes  0.005898   0.000074   0.005972 (  0.005972)
   5000 attributes  0.183733   0.001021   0.184754 (  0.184980)
  10000 attributes  0.758465   0.002561   0.761026 (  0.761851)
  25000 attributes  0.778167   0.002476   0.780643 (  0.781487)
  50000 attributes  0.805981   0.002015   0.807996 (  0.808754)
  75000 attributes  0.801902   0.001647   0.803549 (  0.804507)
stevecheckoway commented 4 years ago

Having to fall back to Nokogiri seems like a Nokogumbo design flaw.

I can't recall what the maximum tree depth was designed to prevent from happening. Maybe there's some other superlinear behavior I'm not remembering.

Without committing to any particular API, I could envision all of these behaviors being reasonable and desirable in some situation.

  1. When the maximum {tree depth,attribute count} is hit, abort the parse (raise an exception);
  2. When the maximum {tree depth,attribute count} is hit, silently discard {deeper elements,additional attributes}; and
  3. When the maximum {tree depth,attribute count} is hit, discard {deeper elements,additional attributes} but signal this fact to the user.
  4. When the maximum {tree depth,attribute count} is hit, add a (configurable?) marker of some kind.

I think any of them other than 1 need to be opt-in. And depending on the signaling mechanism, 2 and 3 could be the same thing. E.g., after opting in, the user would have to check some status.

Would any of those options work for you?

Just as an aside since computer security is my focus, I can't help but think of an attack where a maximum attribute count could be abused to show different content to your users vs. other software without such a limit. If the attribute limit is n, then

<div data-1 data-2 ... data-n display=none>
  Conditional content
</div>

The conditional content would be displayed with an attribute limit but would not be displayed without it.

rafbm commented 4 years ago

Would any of those options work for you?

Of course. Everything you said makes sense. Option 4 may not be crucial though. And after some thinking, option 1 alone would be good enough in our app’s context.

I will change the current PR to raise ArgumentError instead of discarding, like the tree depth case. I believe that would be a good first step to merge on its own; we could shelve other options for another PR.

stevecheckoway commented 4 years ago

TL;DR The patch seems reasonable to me (although I haven't tested). @rubys, thoughts?

I was so sure that the O(n^2) around checking for duplicate attributes behavior was the problem that I decided to fix the issue. I realized that if you're concerned about malicious input, then a hash set is unlikely to be helpful because it's quite tricky to produce one with guaranteed worst-case behavior. So I implemented a simple red-black tree to use as an attribute cache while parsing them and ran your script expecting significantly better behavior.

user     system      total        real
  100 attributes  0.000533   0.000078   0.000611 (  0.000595)
 1000 attributes  0.003087   0.000227   0.003314 (  0.003351)
 5000 attributes  0.082747   0.002031   0.084778 (  0.085699)
10000 attributes  0.348805   0.006017   0.354822 (  0.358073)
25000 attributes  2.244946   0.047076   2.292022 (  2.329323)
50000 attributes 13.702191   0.253913  13.956104 ( 14.201136)
75000 attributes 47.923286   0.795756  48.719042 ( 49.464806)

So obviously that's not the bottle neck. I ran with ruby-prof (actually, the above numbers come from the ruby-prof run) and, unsurprisingly, Nokogumbo#parse is where all the time is taken.

%self      total      self      wait     child     calls  name                           location
99.66     66.442    66.442     0.000     0.000        7   <Module::Nokogumbo>#parse

But that's not helpful because ruby-prof can't profile the C code. So I ran it in Instruments (it's annoyingly hard to configure that correctly with bundler) and identified the culprit.

Time      Weight        Self Weight             Symbol Name
1.08 min  100.0%    1.08 min        xmlNewPropInternal

Slightly different format from ruby-prof, but xmlNewPropInternal is taking 1.08 minutes and Nokogumbo::#parse was taking 66.442 seconds (1.11 minutes). And all but less than 200 ms is coming from xmlNewPropInternal, not any functions it calls.

So let's take a look at the code.

    /*
     * Add it at the end to preserve parsing order ...
     */
    if (node != NULL) {
        if (node->properties == NULL) {
            node->properties = cur;
        } else {
            xmlAttrPtr prev = node->properties;

            while (prev->next != NULL)
                prev = prev->next;
            prev->next = cur;
            cur->prev = prev;
        }
    }

And there it is. That while loop is performing insertion into the end of a doubly linked list by traversing the entire list. In fact, that's pretty clear to see in the annotated assembly in the time profile.

+0x1f5      movq                %rax, %rcx
+0x1f8      movq                48(%rcx), %rax
+0x1fc      testq               %rax, %rax
+0x1ff      jne                 "xmlNewPropInternal+0x1f5"

The test instruction (performing the comparison against NULL) was the hottest instruction.

Finally, just to compare with the current implementation (without your patch), here are the top functions.

Timing script:

user     system      total        real
  100 attributes  0.000447   0.000092   0.000539 (  0.000531)
 1000 attributes  0.008460   0.000275   0.008735 (  0.008842)
 5000 attributes  0.231836   0.004479   0.236315 (  0.238988)
10000 attributes  0.937308   0.015195   0.952503 (  0.963512)
25000 attributes  5.386155   0.083429   5.469584 (  5.532977)
50000 attributes 28.609162   0.400282  29.009444 ( 29.281556)
75000 attributes 84.533603   1.211917  85.745520 ( 86.640742)

ruby-prof:

%self      total      self      wait     child     calls  name                           location
99.83    122.665   122.665     0.000     0.000        7   <Module::Nokogumbo>#parse

Top functions in the time profile.

Time      Weight   Self Weight            Symbol Name
57.01 s   46.8%    56.82 s        xmlNewPropInternal
1.04 min  51.4%    15.09 s        finish_attribute_name
27.95 s   22.9%    27.95 s          strlen
19.43 s   15.9%    19.43 s         _platform_memcmp

xmlNewPropInternal took 57.01 s. finish_attribute_name took 1.04 min, of which 27.95 s were from calling strlen and 19.43 s. from memcmp.

In contrast, with the red-black tree, finish_attribute_name takes 110 ms. A speed up of 567x.

Of course, the overall speed up is about 2 which seems reasonable if I've removed one of two O(n^2) loops. Unfortunately, I cannot do anything about libxml2's implementation. So I guess we should implement a limit as you have done.

rafbm commented 4 years ago

Thanks a lot for this awesome investigation. If you guys want to go forward and merge this, let me know what’s your common practice regarding README updates (documenting master or waiting for a new release). I could update it as well as changelog in this PR.

stevecheckoway commented 4 years ago

I took a look and other than a few errant spaces and an update to the README and the CHANGELOG, this looks good. I'm not sure we have a policy for when the README gets updated, so we may as well do it now.

One thing that might be unclear is should the limit be on the number of unique attributes or the number of attributes to parse? Duplicate attributes when there are only a small number of unique attributes should be fast to parse. I think the current implementation is almost a limit on the number of unique attributes. The one edge case is if the limit is n and there are n unique attributes and the next one to be parsed is a duplicate, then this will raise an error.

I'm not sure I really care about that edge case, but it could probably be made consistent by deferring the check on the count of attributes until after the code checks if it's a duplicate.

rafbm commented 4 years ago

Sorry for the nasty spaces. I updated the README and CHANGELOG as well.

Regarding the duplicate attribute edge case, here’s my take. Although Nokogumbo may be used in scenarios where the HTML payload is always trusted (HTML validation, etc), I feel when talking about 400+ attributes on a given element (with or without duplicates), we are much more likely to be facing an attack, not an error. And if we face an attack, we likely have not just a few hundreds of attributes but tens of thousands of them (like what we get when parsing spam emails received by our users, the very case which led me to investigate and fix this). In such cases I want to save as much CPU as possible, which I believe is better achieved by avoiding the loop on the current attributes we have stored.

I’m saying this because I keep in mind the eventual option we could introduce to ignore attributes and continue parsing instead of raising. When raising, my argument isn’t relevant because you’ll only go through finish_attribute_name() 400 times at most, but with the ignore option, you will go through the function tens of thousands of times, thus the desire to micro-optimize it.

You’re the boss though; I can make it like you suggest if that makes more sense to you. :)

stevecheckoway commented 4 years ago

What you say makes sense. There's no legitimate use case for 400 attributes. I was thinking more about when users pick a smaller limit. But like I said, I don't really care about that edge case so I'm happy to go with what you have.

I'm not sure what's going on with the CI there, but it's not related to this PR.

Thanks so much for working on this!

stevecheckoway commented 4 years ago

And let me say I especially appreciate the script demonstrating the problem. It was really helpful for tracking down what was happening in libxml2.

rafbm commented 4 years ago

Thanks so much for merging! A pleasure working with you. :)

craigbarnes commented 4 years ago

I can't recall what the maximum tree depth was designed to prevent from happening. Maybe there's some other superlinear behavior I'm not remembering.

FWIW, this was done in one of the commits you pulled from lua-gumbo, in order to "fix" https://github.com/google/gumbo-parser/issues/393.