Partial HTML entity being decoded

sipsorcery commented 1 year ago

Hi,

I'm using the snippet below to check HTTP POST strings for potential XSS attacks.

public static bool IsSafe(string raw)
{
    var sanitizer = new HtmlSanitizer();

    var urlDecodedBody = WebUtility.UrlDecode(raw);
    var sanitized = WebUtility.HtmlDecode(sanitizer.Sanitize(urlDecodedBody, outputFormatter: new HtmlFormatter()));

    if (urlDecodedBody.Replace("\r\n", "\n") == sanitized)
    {
        return true;
    }

    return false;
}

It works well but today I hit s sang where a partial HTML entity was decoded as an HTML entity and thereby triggered the check.

With an input of:

merchantID=6f80138d-870b-4b07-8bc4-a4fd33a0d30f&currency=GBP&accountName=Curl%203

the sanitized value is:

merchantID=6f80138d-870b-4b07-8bc4-a4fd33a0d30f¤cy=GBP&accountName=Curl 3

So &curren is being converted to ¤. Is that correct behaviour? Shouldn't it need to be ¤ (with terminating semi-colon) to be treated as an HTML entity?

sipsorcery commented 1 year ago

In answer to my own question it does seem to be correct behaviour.

Chrome renders <p>I will display &curren</p> as I will display ¤.

An alternative question is, is there an option to turn off HTML entities with the HTMLFormatter? Or is there a better approach?

sipsorcery commented 1 year ago

Turnng off HTML entity parsing doesn't seem to be an option https://github.com/mganss/HtmlSanitizer/issues/62.

mganss commented 1 year ago

See #362

mganss / HtmlSanitizer

Partial HTML entity being decoded #414