microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.16k stars 175 forks source link

Sanitization of character entities are replacing for blank spaces #96

Open heltonrlustosa opened 4 years ago

heltonrlustosa commented 4 years ago

Hey. We are using bluemonday library in a new project and in some cases i need to save the string with characthers entities(&nbsp, &lt, &gt...). But, after sanitize some exemples we realise that the output don't have a non-breaking space enitity, for exemple.

Code exemple:

package main

import (
    "fmt"
    "github.com/microcosm-cc/bluemonday"
)

func main() {
    p := bluemonday.UGCPolicy()

    p.AllowStyling()
    p.AllowAttrs("style").Globally()
    p.AllowStandardAttributes()

    result := p.Sanitize(
        `<p>I am normal</p>&nbsp<p style="color:red;">After space</p>&nbsp<p style="font-size:50px;">I am big</p>`,
    )

    fmt.Println(result)
    // Output:
    // <p>I am normal</p> <p style="color:red;">After space</p> <p style="font-size:50px;">I am big</p>
}

Do I forgot to add any policy?

Thank you.

buro9 commented 4 years ago

I do not understand the scenario.

Are you saying that a blank space between paragraphs should be converted to a non-breaking space character?

And I do not see in the example anything that demonstrates an issue with &lt; and &gt;.

If all you seek to do is fully escape a string for presentation as HTML then does https://golang.org/pkg/html/#EscapeString not do this?

heltonrlustosa commented 4 years ago

Sorry, I sent you an incorrect example. My problem is with a escaped string that contains "&nbsp", in that case sanitize is removing then.

I will edit the description and put a correct exemple.

Thank you.

buro9 commented 4 years ago

I think... that it's fine, but that the console and text things display it weird.

Nothing in my code explicitly touches a &nbsp; and I see the net/html package escapes &nbsp; as \u00a0 (unicode non-break space).

I've looked at the output of the example you've provided and initially it looks like they are converted to whitespace. But look closer, put the output into a good text editor and look at the whitespace (or select all whitespace that matches that in I am) and you'll see that the &nbsp; isn't actually a space character. If you inspect it, you'll see it is a unicode non-break space.

So bluemonday is doing precisely what the net/html package believes is the best way to do this.

Delicious-Bacon commented 1 year ago

@heltonrlustosa you should use Golang's %q verb in fmt.Printf function if you wish to see &nbsp; and other "hidden" characters (runes).

fmt.Printf("%q\n", result)
// "<p>I am normal</p>\u00a0<p style=\"color:red;\">After space</p>\u00a0<p style=\"font-size:50px;\">I am big</p>"

\u00a0 == &nbsp;, therefore, bluemonday works as intended, and your implementation does what you wanted it to do.

Read more at fmt package: fmt package

You should close this issue.