microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.14k stars 176 forks source link

apostrophes get turned into HTML entities - take 2 #74

Closed spekary closed 5 years ago

spekary commented 6 years ago

I realize this was asked before, but I would like to renew the discussion for my use case.

I am getting data that is being entered into an html form by a user, and that data will be saved in a database eventually and redisplayed later. Obviously such data needs to be sanitized, so I turned to bluemonday. If a user enters an apostrophe in the data, the apostrophe is coming through to my application in the form data as an apostrophe, so GO is not converting it. When I run it through bluemonday, it gets converted to an html entity.

bluemonday states that it is designed for sanitizing html that will be displayed as html, so I understand why this is correct behavior from bluemonday's perspective.

Soo..., I am asking for a new sanitizer policy that allows bluemonday to be used as a basic text sanitizer for my scenario, where I am not trying to have the end result be html, but still I want the goal that XSS attacks are cleaned out. I imagine it could be a simple process of running the result through UnescapeString, which I will do for now, but you guys are the experts and might have other thoughts as to why this may or may not be adequate.

dmitshur commented 5 years ago

that data will be saved in a database eventually and redisplayed later.

How are you planning to display it later?

It sounds like what you’re looking for is an HTML to text renderer. Then you could use that on output of bluemonday.

buro9 commented 5 years ago

The html package contains html.UnescapeString(): https://golang.org/pkg/html/#UnescapeString

Whilst bluemonday is a HTML sanitizer and expects to take input that is HTML to then be displayed as HTML (and as such escaping HTML entities is correct and part of the underlying core package)... it is possible for you to simply take the output of bluemonday.Sanitize() and run that through html.UnescapeString to convert HTML entities back to plain text.

However, a warning... if you do this and then display the output as HTML, then you've provided the means for someone who can deduce that this behaviour exists to then craft a payload that relies on being unescaped such that the output does contain malicious code that wasn't present in the original input. i.e. providing HTML entities makes &lt;script&gt; safe from the perspective of the sanitizer as it would never render a script tag, but once you run the output through html.UnescapeString() you would have created <script>, which if now displayed as HTML would result in an XSS or worse.

The tools already exist to do what has been asked, but I still would not recommend using them unless you are in full control of how the resulting string will be used. This is why bluemonday carries the warning that it must be the last thing to process the content :)

nkev commented 5 years ago

I still don't get this. Say a user named Jim O'Hare fills in a form on my site and submits it. If I escape the input (using bluemonday or Go functions), the apostrophe becomes &#39; and that's what the user sees next time, which is not right. If I don't escape user inputs then I'm open to XSS attacks. What am I missing?

buro9 commented 5 years ago

You are double-escaping.

Escaping HTML entities is an integral part of sanitising input, but if you've escaped there then escaping again in the presentation layer of your website results in the escaped entities being escaped again.

Either you unescape post-sanitization, or you don't escape a second time. Either option exists for you with the latter being the safest.

nkev commented 5 years ago

Thanks, but I'm escaping only once, and only during presentation:

buro9 commented 5 years ago

Ah, OK.

What I do:

The goal is to never render the original input as HTML within a page, but it can still be rendered as plain text... i.e. within a text input box.

spekary commented 5 years ago

Based on your previous response, would this be reasonable? • Run the original input through bluemonday • Run the result of this through html.UnescapeString() • Save this result in the database • Later, when I want to display the saved value, run the value through html.EscapeString before sending it to the output buffer.

buro9 commented 5 years ago

That would work too.

I prefer to store original and allow users to edit that as it's less surprising. But this is UX rather than security and what you've described also works fine.

leodip commented 9 months ago

I'm doing this (call me crazy)

func (i *InputSanitizer) Sanitize(str string) string {

    p := bluemonday.StrictPolicy()
    p.AllowStandardURLs()

    // sanitizing twice to allow apostrophes, and at the same time,
    // to avoid entries like &lt;script&gt; from becoming <script>
    // some discussions:
    // https://github.com/microcosm-cc/bluemonday/issues/28
    // https://github.com/microcosm-cc/bluemonday/issues/74

    sanitized := p.Sanitize(str)
    unescaped := html.UnescapeString(sanitized)
    sanitized = p.Sanitize(unescaped)
    return html.UnescapeString(sanitized)
}
ACLzz commented 3 months ago

@leodip looks good at first, but if you encode xss two times at the end your string will be with exploit.