microcosm-cc / bluemonday

bluemonday: a fast golang HTML sanitizer (inspired by the OWASP Java HTML Sanitizer) to scrub user generated content of XSS
https://github.com/microcosm-cc/bluemonday
BSD 3-Clause "New" or "Revised" License
3.12k stars 176 forks source link

Double quotation marks parse error #109

Closed gozelus closed 3 years ago

gozelus commented 3 years ago

I got this code:

testStr := `<p>"你行你上"是什么逻辑</p>`
result := bluemonday.StripTagsPolicy().Sanitize(testStr)

the result will be &#34;你行你上&#34;是什么逻辑, but it should be "你行你上"是什么逻辑.

gozelus commented 3 years ago

If anyone could provide some help?

MitulShah1 commented 3 years ago

@gozelus use html.UnescapeString()

testStr := `<p>"你行你上"是什么逻辑</p>`
result := html.UnescapeString(bluemonday.StripTagsPolicy().Sanitize(testStr))
buro9 commented 3 years ago

Well, no :)

HTML entities are escaped by default and as part of the core go team HTML package precisely because this is way to render it safely in a browser.

This is not a bug, it is a feature.

But running potentially dangerous input through bluemonday and then unescaping it afterwards... you may as well not use bluemonday. The failure mode for most HTML sanitizers when the input is meaningless to the parser is to treat it as text and let it pass through, but escaped to make it harmless. If you unescape, then the default failure mode becomes weaponized.

This is a HTML sanitizer, and anything provided as input will be escaped as HTML. It's not a text sanitizer (which would make sense as text wouldn't be rendered as HTML or would itself be escaped by the go templating system).

Hmm... that's a point, is the problem that you then use the go templating and it ends up double-escaped? If so, the solution is to disable the escaping on the instance of use within go templating, because escaping has already been done within bluemonday.

MitulShah1 commented 3 years ago

@buro9 , Yes You are correct. but what if we want sanitise rest API request which in json format? here i face same issue

{"xyz":123} which is converted {&#34;xyz&#34;:123} and its fail json decod so what is solution here?

buro9 commented 3 years ago

This is a HTML sanitizer not a JSON sanitizer. I would suggest for JSON that you have a few options.

One option is that you use something like https://json-schema.org/ to validate and verify that the JSON is valid for your endpoint and that the data types are correct, etc.

Another option if you intend to put HTML within JSON, as values on a property, is to use bluemonday only to sanitize the values of the properties rather than the whole JSON document. Bluemonday is fast enough that even if you have multiple strings that contain HTML within your JSON this should not add noticeable load (though at extreme volumes if vertically rather than horizontally scaling it would be noticeable, but then you wouldn't be using JSON at that point as the deserialisation is noticeable too at that scale).

I'm going to close this issue, bluemonday is a HTML sanitizer and the scenario here is beyond the scope of this project alone.