Closed rafbm closed 4 years ago
Hi,
Thanks for this PR.
Like the max tree depth option, I think this is a fine thing to add. Looking at the code, checking for duplicate attribute names is clearly O(n^2) in the number of names. (This could be amortized constant time using a hash set. It may be worth measuring how much of a slow down that is for the normal case at some point in the future. I'll make a note about it.)
I think hitting this limit should be reported to the user in some fashion as silently dropping attributes seems bad. For hitting the tree depth limit, we raise an argument error. Should we do the same thing here or something else?
Thanks for the follow up. Now that you mention it, it‘s true that silently ignoring attributes VS raising for exceeded tree depth isn’t consistent.
In our application code base, we actually rescue the ArgumentError
raised by Nokogumbo when hitting the max tree depth, and in the rescue block we fallback to parsing with plain Nokogiri, which has a ~256 non-customizable max tree depth behavior that works more like I did :max_attributes
here; it silently discards any deeper content. Ultimately as an email client we want to keep and display as much decent HTML as we can to our user, so for any edge case like this we prefer a silent limiting behavior rather than bailing and having to display “This email isn’t rendered because it is too extreme in some way.” I understand other applications might have different expectations.
To be consistent, I can tweak this new code to raise ArgumentError
instead of ignoring further attributes. In our app we could rescue it and parse with plain Nokogiri like for the tree depth case. Some custom subclasses of ArgumentError
would be nice, like MaxTreeDepthExceeded
, MaxAttributesExceeded
.
That said, I’m curious to see whether the O(n^2) issue you noticed will make my PR here irrelevant. Let me know how you see it.
I just tested the script with plain Nokogiri and it suffers the same issue, even worse:
user system total real
100 attributes 0.000267 0.000033 0.000300 ( 0.000298)
1000 attributes 0.011261 0.000078 0.011339 ( 0.011340)
5000 attributes 0.297772 0.000709 0.298481 ( 0.298574)
10000 attributes 1.195172 0.002421 1.197593 ( 1.198453)
25000 attributes 7.266606 0.013161 7.279767 ( 7.284907)
50000 attributes 29.401483 0.043176 29.444659 ( 29.468001)
75000 attributes 65.669765 0.094587 65.764352 ( 65.853030)
So falling back to Nokogiri is not an option here. It would be really ideal to silently ignore attributes past X in our app’s case. I wonder what would be a proper design. Some limits_exceeded: :silence
/ :raise
argument, that would alter the behavior of :max_tree_depth
and :max_attributes
?
Perhaps if we just heavily increase DEFAULT_MAX_ATTRIBUTES
to something like 10000
, then silencing is OK for everyone and we don’t actually need the :max_attributes
argument?
Here’s what I get with max_attributes: 10_000
, quite reasonable:
user system total real
100 attributes 0.000252 0.000019 0.000271 ( 0.000270)
1000 attributes 0.005898 0.000074 0.005972 ( 0.005972)
5000 attributes 0.183733 0.001021 0.184754 ( 0.184980)
10000 attributes 0.758465 0.002561 0.761026 ( 0.761851)
25000 attributes 0.778167 0.002476 0.780643 ( 0.781487)
50000 attributes 0.805981 0.002015 0.807996 ( 0.808754)
75000 attributes 0.801902 0.001647 0.803549 ( 0.804507)
Having to fall back to Nokogiri seems like a Nokogumbo design flaw.
I can't recall what the maximum tree depth was designed to prevent from happening. Maybe there's some other superlinear behavior I'm not remembering.
Without committing to any particular API, I could envision all of these behaviors being reasonable and desirable in some situation.
I think any of them other than 1 need to be opt-in. And depending on the signaling mechanism, 2 and 3 could be the same thing. E.g., after opting in, the user would have to check some status.
Would any of those options work for you?
Just as an aside since computer security is my focus, I can't help but think of an attack where a maximum attribute count could be abused to show different content to your users vs. other software without such a limit. If the attribute limit is n, then
<div data-1 data-2 ... data-n display=none>
Conditional content
</div>
The conditional content would be displayed with an attribute limit but would not be displayed without it.
Would any of those options work for you?
Of course. Everything you said makes sense. Option 4 may not be crucial though. And after some thinking, option 1 alone would be good enough in our app’s context.
I will change the current PR to raise ArgumentError
instead of discarding, like the tree depth case. I believe that would be a good first step to merge on its own; we could shelve other options for another PR.
TL;DR The patch seems reasonable to me (although I haven't tested). @rubys, thoughts?
I was so sure that the O(n^2) around checking for duplicate attributes behavior was the problem that I decided to fix the issue. I realized that if you're concerned about malicious input, then a hash set is unlikely to be helpful because it's quite tricky to produce one with guaranteed worst-case behavior. So I implemented a simple red-black tree to use as an attribute cache while parsing them and ran your script expecting significantly better behavior.
user system total real
100 attributes 0.000533 0.000078 0.000611 ( 0.000595)
1000 attributes 0.003087 0.000227 0.003314 ( 0.003351)
5000 attributes 0.082747 0.002031 0.084778 ( 0.085699)
10000 attributes 0.348805 0.006017 0.354822 ( 0.358073)
25000 attributes 2.244946 0.047076 2.292022 ( 2.329323)
50000 attributes 13.702191 0.253913 13.956104 ( 14.201136)
75000 attributes 47.923286 0.795756 48.719042 ( 49.464806)
So obviously that's not the bottle neck. I ran with ruby-prof
(actually, the above numbers come from the ruby-prof
run) and, unsurprisingly, Nokogumbo#parse
is where all the time is taken.
%self total self wait child calls name location
99.66 66.442 66.442 0.000 0.000 7 <Module::Nokogumbo>#parse
But that's not helpful because ruby-prof
can't profile the C code. So I ran it in Instruments (it's annoyingly hard to configure that correctly with bundler
) and identified the culprit.
Time Weight Self Weight Symbol Name
1.08 min 100.0% 1.08 min xmlNewPropInternal
Slightly different format from ruby-prof
, but xmlNewPropInternal
is taking 1.08 minutes and Nokogumbo::#parse
was taking 66.442 seconds (1.11 minutes). And all but less than 200 ms is coming from xmlNewPropInternal
, not any functions it calls.
So let's take a look at the code.
/*
* Add it at the end to preserve parsing order ...
*/
if (node != NULL) {
if (node->properties == NULL) {
node->properties = cur;
} else {
xmlAttrPtr prev = node->properties;
while (prev->next != NULL)
prev = prev->next;
prev->next = cur;
cur->prev = prev;
}
}
And there it is. That while
loop is performing insertion into the end of a doubly linked list by traversing the entire list. In fact, that's pretty clear to see in the annotated assembly in the time profile.
+0x1f5 movq %rax, %rcx
+0x1f8 movq 48(%rcx), %rax
+0x1fc testq %rax, %rax
+0x1ff jne "xmlNewPropInternal+0x1f5"
The test
instruction (performing the comparison against NULL
) was the hottest instruction.
Finally, just to compare with the current implementation (without your patch), here are the top functions.
Timing script:
user system total real
100 attributes 0.000447 0.000092 0.000539 ( 0.000531)
1000 attributes 0.008460 0.000275 0.008735 ( 0.008842)
5000 attributes 0.231836 0.004479 0.236315 ( 0.238988)
10000 attributes 0.937308 0.015195 0.952503 ( 0.963512)
25000 attributes 5.386155 0.083429 5.469584 ( 5.532977)
50000 attributes 28.609162 0.400282 29.009444 ( 29.281556)
75000 attributes 84.533603 1.211917 85.745520 ( 86.640742)
ruby-prof
:
%self total self wait child calls name location
99.83 122.665 122.665 0.000 0.000 7 <Module::Nokogumbo>#parse
Top functions in the time profile.
Time Weight Self Weight Symbol Name
57.01 s 46.8% 56.82 s xmlNewPropInternal
1.04 min 51.4% 15.09 s finish_attribute_name
27.95 s 22.9% 27.95 s strlen
19.43 s 15.9% 19.43 s _platform_memcmp
xmlNewPropInternal
took 57.01 s. finish_attribute_name
took 1.04 min, of which 27.95 s were from calling strlen
and 19.43 s. from memcmp
.
In contrast, with the red-black tree, finish_attribute_name
takes 110 ms. A speed up of 567x.
Of course, the overall speed up is about 2 which seems reasonable if I've removed one of two O(n^2) loops. Unfortunately, I cannot do anything about libxml2
's implementation. So I guess we should implement a limit as you have done.
Thanks a lot for this awesome investigation. If you guys want to go forward and merge this, let me know what’s your common practice regarding README updates (documenting master or waiting for a new release). I could update it as well as changelog in this PR.
I took a look and other than a few errant spaces and an update to the README and the CHANGELOG, this looks good. I'm not sure we have a policy for when the README gets updated, so we may as well do it now.
One thing that might be unclear is should the limit be on the number of unique attributes or the number of attributes to parse? Duplicate attributes when there are only a small number of unique attributes should be fast to parse. I think the current implementation is almost a limit on the number of unique attributes. The one edge case is if the limit is n and there are n unique attributes and the next one to be parsed is a duplicate, then this will raise an error.
I'm not sure I really care about that edge case, but it could probably be made consistent by deferring the check on the count of attributes until after the code checks if it's a duplicate.
Sorry for the nasty spaces. I updated the README and CHANGELOG as well.
Regarding the duplicate attribute edge case, here’s my take. Although Nokogumbo may be used in scenarios where the HTML payload is always trusted (HTML validation, etc), I feel when talking about 400+ attributes on a given element (with or without duplicates), we are much more likely to be facing an attack, not an error. And if we face an attack, we likely have not just a few hundreds of attributes but tens of thousands of them (like what we get when parsing spam emails received by our users, the very case which led me to investigate and fix this). In such cases I want to save as much CPU as possible, which I believe is better achieved by avoiding the loop on the current attributes we have stored.
I’m saying this because I keep in mind the eventual option we could introduce to ignore attributes and continue parsing instead of raising. When raising, my argument isn’t relevant because you’ll only go through finish_attribute_name()
400 times at most, but with the ignore option, you will go through the function tens of thousands of times, thus the desire to micro-optimize it.
You’re the boss though; I can make it like you suggest if that makes more sense to you. :)
What you say makes sense. There's no legitimate use case for 400 attributes. I was thinking more about when users pick a smaller limit. But like I said, I don't really care about that edge case so I'm happy to go with what you have.
I'm not sure what's going on with the CI there, but it's not related to this PR.
Thanks so much for working on this!
And let me say I especially appreciate the script demonstrating the problem. It was really helpful for tracking down what was happening in libxml2.
Thanks so much for merging! A pleasure working with you. :)
I can't recall what the maximum tree depth was designed to prevent from happening. Maybe there's some other superlinear behavior I'm not remembering.
FWIW, this was done in one of the commits you pulled from lua-gumbo, in order to "fix" https://github.com/google/gumbo-parser/issues/393.
Prior to this commit, the time taken to parse elements with an unreasonable number of attributes is non-linear and can lead to excessive memory usage, process hanging and crash.
As a consumer of unsafe HTML, you must enforce some limits on payloads prior to parsing. One easy example is truncating the input string to some max length. However, in a case of extreme number of attributes, there’s no practical way to protect yourself while preserving a decent parsed HTML output (the HTML length isn’t that long by itself). Thus, I believe Nokogumbo should internally enforce a limit after which attributes of a given element will be ignored.
Here’s a script to illustrate the current non-linear work time:
With the proposed fix, that script becomes:
To determine what the value of
Nokogumbo::DEFAULT_MAX_ATTRIBUTES
should be, I looked at lists of attributes to have an idea of how many attributes you could technically use on a given element:https://github.com/wooorm/html-element-attributes/blob/master/index.json https://github.com/wooorm/svg-element-attributes/blob/master/index.json https://github.com/wooorm/aria-attributes/blob/master/index.json
(script used)
I believe all of these can be combined on the same SVG element. This obviously doesn’t consider
data-*
attributes, but these are unlimited in theory. So the value we’ll pick will be arbitrary no matter what. I ended up using400
, just because we haveNokogumbo::DEFAULT_MAX_TREE_DEPTH = 400
already. Same number seemed to make some sense.I assumed we’d want an argument to customize that limit. I went with
:max_attributes
although it could be clearer with:max_attributes_per_element
. But I don’t think the added length is desirable nor necessary.