microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.08k stars 178 forks source link

Matching() does not do what's expected. #3

Closed dmitshur closed 10 years ago

dmitshur commented 10 years ago

Hi @buro9,

I think there's a bug in bluemonday. According to your comment,

So class="foo bar bash" is fine and would be allowed, but class="javascript:alert('XSS')" would fail and be stripped, and class="><script src='http://hackers.org/XSS.js'></script>" would also fail and be stripped.

I've added tests for that and noticed they failed.

I tried using bluemonday directly, and saw behavior counter to what you suggested. All three class="foo bar bash", class="javascript:alert(123)", and class="><script src='http://hackers.org/XSS.js'></script>" were passed through, without failing and being stripped.

$ goe --quiet 'p:=bluemonday.UGCPolicy(); p.AllowAttrs("class").Matching(bluemonday.SpaceSeparatedTokens).OnElements("span"); println(p.Sanitize(`Hello <span class="foo bar bash">there</span> world.`))'
Hello <span class="foo bar bash">there</span> world.

$ goe --quiet 'p:=bluemonday.UGCPolicy(); p.AllowAttrs("class").Matching(bluemonday.SpaceSeparatedTokens).OnElements("span"); println(p.Sanitize(`Hello <span class="javascript:alert(123)">there</span> world.`))'
Hello <span class="javascript:alert(123)">there</span> world.

$ goe --quiet 'p:=bluemonday.UGCPolicy(); p.AllowAttrs("class").Matching(bluemonday.SpaceSeparatedTokens).OnElements("span"); println(p.Sanitize(`Hello <span class="><script src='"'http://hackers.org/XSS.js'"'></script>">there</span> world.`))'
Hello <span class="&gt;&lt;script src=&#39;http://hackers.org/XSS.js&#39;&gt;&lt;/script&gt;">there</span> world.

Looking at the code, the reason is clear.

SpaceSeparatedTokens = regexp.MustCompile(`[\s\p{L}\p{N}_-]+`)

if ap.regexp.MatchString(htmlAttr.Val) {
    cleanAttrs = append(cleanAttrs, htmlAttr)
}

The regex doesn't have ^ at the front nor $ at the end, so it matches any substring. That's why <span class="javascript:alert(123)">there</span> is not stripped, but <span class=":::::">there</span> is stripped as expected.

How to fix that... I leave to you (there's more than one way, and I'm not a fan of regexes).

buro9 commented 10 years ago

This is resolved by https://github.com/microcosm-cc/bluemonday/commit/e5dc5729273710c9747856d55005c7c250c4e767

I have added a test for the examples provided as well as providing a test suite for all regular expressions that we ship.

dmitshur commented 10 years ago

Thanks for fixing it so promptly!