microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.14k stars 176 forks source link

When a self-closing iframe is present, content afterward does not get sanitized. #81

Closed dy-dx closed 4 years ago

dy-dx commented 5 years ago
package main

import (
    "fmt"
    "html"

    "github.com/microcosm-cc/bluemonday"
)

func main() {
    input := `<iframe /><script type="text/javascript">doBadStuff();</script>`

    fmt.Println(html.UnescapeString(bluemonday.UGCPolicy().Sanitize(input)))
    // output:
    // <script type="text/javascript">doBadStuff();</script>
}
c14h3 commented 4 years ago

I got an other output after Sanitizing currently: &lt;script type=&#34;text/javascript&#34;&gt;dobadstuff();&lt;/script&gt; this is fine I think.

akasandra commented 4 years ago

To measure the reliability of the library, it is worth writing an explanation why this has happened. Since this issue has occurred, it is now clear that library is not safe and we have to perform complex fuzzing to find out more bugs in its function.

buro9 commented 4 years ago

There is no such thing as a self-closing IFRAME. Some HTML elements may omit end tags, but the IFRAME is not one of those. This means that the / is erroneous and for <iframe /><script type="text/javascript">doBadStuff();</script> the IFRAME is opened, but not closed... at this point the underlying Go HTML tokenization library will treat the contents of the IFRAME as plain text and escape it as such https://github.com/golang/net/blob/master/html/token.go#L1219

This is not a security issue, as the text is then escaped as raw text as is appropriate for the content of those tags, inline with the behaviour of the Go HTML library. As the output is escaped it does not pose a risk.

I have investigated this, but this behaviour is both safe, and aligns to other HTML processors. This means that this Github issue represents a case of GIGO and whilst the library continues to ensure there is no security risk, it would look unpretty when rendered as the escaped text would be rendered. This behaviour is correct.

buro9 commented 4 years ago

PS: Thank you though... I enjoyed investigating what other HTML engines do here and also found examples in bug reports to other libraries closed similarly or with jsfiddles to demonstrate the issue, and saw that the W3C validator reports a self-closing IFRAME as invalid.

buro9 commented 4 years ago

The key reference is this one: https://www.w3.org/TR/2016/PR-html51-20160915/semantics-embedded-content.html#the-iframe-element

Neither tag is omissible.

Q: So what is supposed to be between the opening and closing IFRAME tag? A: Content to be presented to user agents that do not support IFRAME.

What goes between the opening and closing IFRAME tag is equivalent to what you may put inside a NOSCRIPT tag.

The HTML 4 spec shows this clearly, though the HTML 5 spec seems to have not made it as explicit:

https://www.w3.org/TR/html401/present/frames.html#h-16.5

<IFRAME src="foo.html" width="400" height="500"
             scrolling="auto" frameborder="1">
  [Your user agent does not support frames or is currently configured
  not to display frames. However, you may visit
  <A href="foo.html">the related document.</A>]
  </IFRAME>
akasandra commented 4 years ago

When you load invalid HTML in some browsers, browser may behave differently. Some engines could try to ignore the mistake. Meaning the issue turns current strategy into a flaw.

buro9 commented 4 years ago

This library does not attempt to correct input according to target clients that may behave differently.

This library only provides security against the input being rendered in such a way as to provide a mechanism for XSS.

GIGO applies.

Whatever generates the input that you wish to sanitize should already produce the well-formed HTML that you wish to render.

If you deem this a flaw, then the flaw resides in whatever is producing the self-closing IFRAME as this is not valid HTML.

If you use this: http://validator.w3.org/

And validate against direct input:

<iframe /><script type="text/javascript">doBadStuff();</script>

You will see both parts of the issue with the input:

  1. A self-closing slash is not valid on IFRAME
  2. The SCRIPT tag is now treated as text
Screenshot 2020-06-17 at 10 08 05

This library continues to perform correctly, and escapes the text inside of the (unterminated) IFRAME, this providing the security feature.

The input needs fixing. If the input is created by a tool then that tool has a bug.