Update to Nokogumbo 2.0

This can't be merged as is, but I wanted to share what I'd done, in case you want to update Nokogumbo to 2.0.

Nokogumbo 2.0 has had a bunch of fixes (many around errors which I don't think sanitize cares about), but it also does spec-compliant parsing of HTML fragments and serialization.

This changes a number of things as mentioned here.

There are still 7 test failures with this PR, some of which I can explain, others I'm not sure why they're happening and I haven't investigated closely.

  1) Failure:
Malicious HTML::unsafe libxml2 server-side includes in attributes#test_0015_should not escape characters unnecessarily [/Users/steve/programming/sanitize/test/test_malicious_html.rb:182]:
--- expected
+++ actual
@@ -1 +1 @@
-"<div name='examp<!--\" onmouseover=alert(1)>-->le.com'>foo</div>"
+"<div name=\"examp<!--&quot; onmouseover=alert(1)>-->le.com\">foo</div>"

  2) Failure:
Transformers::YouTube transformer#test_0002_should allow HTTPS YouTube video embeds [/Users/steve/programming/sanitize/test/test_transformers.rb:181]:
--- expected
+++ actual
@@ -1 +1 @@
-"<iframe width=\"420\" height=\"315\" src=\"https://www.youtube.com/embed/QH2-TGUlwu4\" frameborder=\"0\" allowfullscreen=\"\">&lt;script&gt;alert()&lt;/script&gt;</iframe>"
+"<iframe width=\"420\" height=\"315\" src=\"https://www.youtube.com/embed/QH2-TGUlwu4\" frameborder=\"0\" allowfullscreen=\"\"><script>alert()</script></iframe>"

  3) Failure:
Transformers::YouTube transformer#test_0003_should allow protocol-relative YouTube video embeds [/Users/steve/programming/sanitize/test/test_transformers.rb:188]:
--- expected
+++ actual
@@ -1 +1 @@
-"<iframe width=\"420\" height=\"315\" src=\"//www.youtube.com/embed/QH2-TGUlwu4\" frameborder=\"0\" allowfullscreen=\"\">&lt;script&gt;alert()&lt;/script&gt;</iframe>"
+"<iframe width=\"420\" height=\"315\" src=\"//www.youtube.com/embed/QH2-TGUlwu4\" frameborder=\"0\" allowfullscreen=\"\"><script>alert()</script></iframe>"

  4) Failure:
Transformers::YouTube transformer#test_0004_should allow privacy-enhanced YouTube video embeds [/Users/steve/programming/sanitize/test/test_transformers.rb:195]:
--- expected
+++ actual
@@ -1 +1 @@
-"<iframe width=\"420\" height=\"315\" src=\"https://www.youtube-nocookie.com/embed/QH2-TGUlwu4\" frameborder=\"0\" allowfullscreen=\"\">&lt;script&gt;alert()&lt;/script&gt;</iframe>"
+"<iframe width=\"420\" height=\"315\" src=\"https://www.youtube-nocookie.com/embed/QH2-TGUlwu4\" frameborder=\"0\" allowfullscreen=\"\"><script>alert()</script></iframe>"

  5) Failure:
Transformers::YouTube transformer#test_0001_should allow HTTP YouTube video embeds [/Users/steve/programming/sanitize/test/test_transformers.rb:174]:
--- expected
+++ actual
@@ -1 +1 @@
-"<iframe width=\"420\" height=\"315\" src=\"http://www.youtube.com/embed/QH2-TGUlwu4\" frameborder=\"0\" allowfullscreen=\"\">&lt;script&gt;alert()&lt;/script&gt;</iframe>"
+"<iframe width=\"420\" height=\"315\" src=\"http://www.youtube.com/embed/QH2-TGUlwu4\" frameborder=\"0\" allowfullscreen=\"\"><script>alert()</script></iframe>"

  6) Failure:
Parser#test_0006_should work around the libxml2 content-type meta tag bug [/Users/steve/programming/sanitize/test/test_parser.rb:54]:
--- expected
+++ actual
@@ -1 +1 @@
-"<html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"></head><body>Howdy!</body></html>"
+"<html><head><meta http-equiv=\"content-type\" content=\"text/html;charset=us-ascii\"></head><body>Howdy!</body></html>"

  7) Failure:
Sanitize::Transformers::CleanDoctype::when :allow_doctype is true#test_0001_should allow doctype declarations in documents [/Users/steve/programming/sanitize/test/test_clean_doctype.rb:39]:
--- expected
+++ actual
@@ -1 +1 @@
-"<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01//EN\"><html>foo</html>"
+"<!DOCTYPE html><html>foo</html>"

Failures 2–5 seem like pretty concerning failures. I'm not sure why that happens.

The HTML spec specifies that " in attribute values be escaped as " and attribute values themselves are always surrounded by ". That explains Failure 1.

Libxml2 has the horrible behavior of inserting meta elements which you had code to remove. Nokogumbo 2.0 no longer uses libxml2's serialization (which is broken for HTML5 anyway, e.g., https://github.com/rubys/nokogumbo/issues/14). Nokogumbo no longer does anything with meta elements, hence Failure 6.

Finally, the HTML spec strips out public and system identifiers in DOCTYPEs when serializing. That explains Failure 7.

I have to run right this minute, but I'm happy to help make this work for you.

Thanks! This is super helpful. I should have some time this weekend to dive in and see what's up with those remaining test failures.

For the content type failure, something like

attr = doc.at_xpath('/html/head/meta[@http-equiv="content-type"]/@content')
attr.content = 'text/html; charset=utf-8' unless attr.nil?

should fix it.

I'm open to adding an option to the serialize methods to output the DOCTYPE public and system identifiers, if present. HTML treats a present but empty identifier as different from a missing identifier but Gumbo does not. As a result, I believe Nokogumbo treats an empty identifier as missing. I think this would be easy to fix.

I'm also open to adding a slightly more smart attribute serialization. We probably don't want to include something as aggressive as this which will turn <foo bar="baz"> into <foo bar=baz>, but checking if the attribute value contains " and not ' and using single quotes in that case seems simple and reasonable.

I'm open to adding an option to the serialize methods to output the DOCTYPE public and system identifiers, if present.

I don't have any real opinion on this, so whatever you think is best is fine with me.

I'm also open to adding a slightly more smart attribute serialization.

I personally prefer always using double quoted attributes. I think the principle of least surprise is valuable here; if Sanitize's output sometimes uses double quotes and sometimes uses single quotes depending on attribute values, that could surprise people.

I've been looking into the YouTube transformer failures. When parsing a non-empty <iframe> element, Nokogumbo treats its contents as text:

#(Element:0x3fc8e1c86d44 {
  name = "iframe",
  attributes = [
    #(Attr:0x3fc8e1c86380 { name = "width", value = "420" }),
    #(Attr:0x3fc8e1c8636c { name = "height", value = "315" }),
    #(Attr:0x3fc8e1c86358 {
      name = "src",
      value = "http://www.youtube.com/embed/QH2-TGUlwu4"
      }),
    #(Attr:0x3fc8e1c86344 { name = "frameborder", value = "0" }),
    #(Attr:0x3fc8e1c86330 { name = "allowfullscreen", value = "" }),
    #(Attr:0x3fc8e1c8631c { name = "bogus", value = "bogus" })],
  children = [ #(Text "<script>alert()</script>")]
  })

This seems correct from a parsing perspective. But when the iframe is serialized, this text content is serialized verbatim, and the <script> is not escaped.

When these tests were originally written, my understanding was that <iframe> content could be rendered as fallback markup in legacy browsers that don't support iframes, so script injection could be possible here.

The HTML Living Standard says <iframe> must not contain anything, but adds:

Descendants of iframe elements represent nothing. (In legacy user agents that do not support iframe elements, the contents would be parsed as markup that could act as fallback content.)

Note: The HTML parser treats markup inside iframe elements as text.

It also says that when serializing a Text node whose parent is an iframe, its content should be appended literally without escaping:

If the parent of current node is a style, script, xmp, iframe, noembed, noframes, or plaintext element, or if the parent of current node is a noscript element and scripting is enabled for the node, then append the value of current node's data IDL attribute literally.

So it seems like Nokogumbo is behaving correctly here, but it also seems like the result could be unsafe in old browsers, though I'm not sure exactly which browsers are old enough to both not support iframes and to execute scripts within iframe content. I guess a safe route might be for Sanitize to manually escape iframe contents? 🤔

I'm open to adding an option to the serialize methods to output the DOCTYPE public and system identifiers, if present.

I don't have any real opinion on this, so whatever you think is best is fine with me.

I have no need of such functionality as I don't need to support transitional documents. So if you're okay with having the identifiers removed, then the test case could be updated. That might violate the principle of least surprise for your users though. So up to you.

I'm also open to adding a slightly more smart attribute serialization.

I personally prefer always using double quoted attributes. I think the principle of least surprise is valuable here; if Sanitize's output sometimes uses double quotes and sometimes uses single quotes depending on attribute values, that could surprise people.

I agree. So then I guess the test case could change to using double quotes with "

So it seems like Nokogumbo is behaving correctly here, but it also seems like the result could be unsafe in old browsers, though I'm not sure exactly which browsers are old enough to both not support iframes and to execute scripts within iframe content. I guess a safe route might be for Sanitize to manually escape iframe contents? 🤔

Escaping the text should be relatively straight forward. Something like

doc = Nokogiri::HTML5.fragment('<iframe><script>alert("hi")</script></iframe>')
doc.xpath('.//iframe/text()').each do |t|
  t.content = t.content.gsub(/[&\u00a0<>]/, '&' => '&amp;', "\u00a0" => '&nbsp;', '<' => '&lt;', '>' => '&gt;')
end

(I don't know why the period in that XPath expression is necessary, but it doesn't work without it.)

Alternatively, you could just drop all children of an iframe.

Cool, sounds like we’re on the same page. I’ve got most of these changes finished, but had to stop for tonight. Should be able to finish them up tomorrow and merge this.

rgrove / sanitize

Update to Nokogumbo 2.0 #189