mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.55k stars 200 forks source link

Href being removed from anchor tags #242

Closed adamhalesworth closed 4 years ago

adamhalesworth commented 4 years ago

Using the default configuration, href attributes are being removed from a tags:

<p>Testing <a href=\"https://www.google.com\" target=\"_blank\">a link</a></p>

Becomes:

<p>Testing <a target="\&quot;_blank\&quot;">a link</a></p>

The URL uses a scheme from the default schemes (https) and during sanitization, RemovingAttribute doesn't get fired either, so not sure why this is occurring?

mganss commented 4 years ago

There's a backslash preceding each double quote which results in the " becoming part of the URL. This means it doesn't have the https scheme and gets removed. RemovingAttribute does fire for me.

adamhalesworth commented 4 years ago

Aha, for some reason I didn't think that would matter! The HTML is coming in as part of a JSON payload, so I'll add a step to clean it up first before sanitizing.

Thanks for taking the time to respond. Can we keep this open a bit longer while I get it working?

mganss commented 4 years ago

Are you sure you're parsing the JSON correctly? The backslash is an escape character in JSON so it looks like the string might not have been properly decoded. If it has been properly decoded, then there might be a double encoding issue at the other end. The latter would mean the raw JSON has two backslashes before each double quote (href=\\" etc.).

Sure, we can leave this issue open for a while.

adamhalesworth commented 4 years ago

I've managed to get this working successfully and can confirm that RemovingAttribute now fires as expected. Thanks for pointing me in the right direction, I wouldn't have considered those escape characters to be an issue, but I've learned my lesson 👍