microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.2k stars 175 forks source link

Suggestion: Insert white space when stripping tags #33

Closed crantok closed 8 years ago

crantok commented 8 years ago

I'm using the StrictPolicy() to strip tags from text in order to feed mongoDB full text search. The text content of adjacent elements may be visually separated by html rendering even though there is no whitespace in the text. Stripping the tags therefore merges words potentially altering search results. Here's an example:

package main

import (
    "fmt"
    "github.com/microcosm-cc/bluemonday"
)

func main() {
    userInput := "<p>Why oh why</p><p>she swallowed a fly</p>"
    searchableText := bluemonday.StrictPolicy().Sanitize(userInput)

    fmt.Println(searchableText) // Why oh whyshe swallowed a fly
}

I can easily solve this in my own code, e.g. by inserting a space before or after every block-level html element before stripping the tags.

I wondered whether this would be a generally useful feature. A general case might need configuration given that even adjacent inline elements can be visually separated through CSS.

crantok commented 8 years ago

Just updated and tested in my own code. I like the way you reduced my suggestion to the simplest possible feature.

Thank you :)

grafana-dee commented 8 years ago

I like the way you reduced my suggestion to the simplest possible feature.

I figure that:

Plus... I'm lazy :)

crantok commented 8 years ago

Awesome :)

alltom commented 7 years ago

I wanted this for the same reason! Thanks!

For the purposes of indexing, it's a little unfortunate that AddSpaceWhenStrippingTag(true) also inserts spaces when it removes inline tags. So sanitizing <div>Go with<em>out</em></div><div>me</div> yields Go with out me instead of Go without me.

Not a blocker for me, but thought I'd point it out. :)

dmitshur commented 7 years ago

It's probably not possible to know what is an inline tag in a general case, unfortunately. Even <em> can be a block tag if CSS includes em { display: block; } or em { padding: 20px; }.