mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.52k stars 198 forks source link

&quot getting rendered as ", without a semicolon #362

Closed tyler-shuhnicki closed 2 years ago

tyler-shuhnicki commented 2 years ago

Hello!

We recently began sanitize sitewide of all requests and had a very strange production issue. We sanitize all incoming query string parameters, and only one our APIs began suddenly failing.

After many red herrings, the issue was discovered: our query string had multiple parameters, and the parameter quoteId was behind an &. See screenshot below for the translation found in our logs:

image

In the short term, we've capitalize the Q which solves the problem, but I believe &quot should not be rendered as " if there is no semicolon since the proper HTML entity is &quot ; (needed to add a space - Github renders it as " otherwise!)

Thanks - this project has been very helpful!

mganss commented 2 years ago

I believe this is expected behavior. While it's true that the missing semicolon causes a parse error, according to the spec the parser behaves as if the semicolon was present:

This error occurs if the parser encounters a character reference that is not terminated by a U+003B (;) code point. Usually the parser behaves as if character reference is terminated by the U+003B (;) code point

The browsers I've tested behave the same way. Given the following HTML browsers show A " B:

<!DOCTYPE html>
<html>
<body>
    <p>A &quot B</p>
</body>
</html>

What's your use case? Is the query string you're sanitizing part of an HTTP request you're handling?

tyler-shuhnicki commented 2 years ago

Interesting - that's unexpected on our end, but if browsers follow the same spec, there's little to argue.

Yes - we've added sanatization in our middleware that strips out any XSS in the request path, body, and query string to further hardern our API. This was obviously a very unexpected edge case - and solved fairly easily by sending "quoteId" with a capital Q instead of its lowercase form.

mganss commented 2 years ago

IMHO this is not the ideal level to apply sanitization at. Perhaps it's better to sanitize only those strings that are intended to be rendered back to HTML. So let your server side code decode the query string and apply validation, then sanitize only the relevant parameters that are supposed to be rendered back out later (like user comments etc). This would prevent the issue you have encountered. For example:

?quoteId=1234&comment=%3Cb%3Etest%3C%2Fb%3E%3Cscript%3Exyz%3C%2Fscript%3E

First, decode the query string into (pseudo code)

var quoteId = 1234;
var comment = "<b>test</b><script>xyz</script>";

Then sanitize only comment.