mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.51k stars 198 forks source link

Error on sanitizing simple post without any invalid char. #534

Closed edika99 closed 4 months ago

edika99 commented 4 months ago

I'm using the tool to sanitize input form posted on a razor page. I0ve set amiddleware that analyze the posted data and return an error if sanitized text is different fro the reuqest body. Here is the code I use

httpContext.Request.EnableBuffering();

using (var streamReader = new StreamReader(httpContext.Request.Body, Encoding.UTF8, leaveOpen: true))
{
    var raw = await streamReader.ReadToEndAsync();
    var sanitizer = new HtmlSanitizer();
    var sanitised = sanitizer.Sanitize(raw.ToString());

    if (!raw.Equals(sanitised))
    {
        await RespondWithAnError(httpContext).ConfigureAwait(false);
        return;
    }
}
httpContext.Request.Body.Seek(0, SeekOrigin.Begin);
await _next.Invoke(httpContext);`

In my form the raw string is like this:

gtoken=03AFcWeA6iBz57f-Yhnuw19PKO82S_XfjdBXJn8Ymn_eLwNGv5EQLvwnMzqEq-LnqQC3uEUatdzCtWmFfUa4nEtrAVuTI3GqhiFyQZoWDDmkVJc2d_93fYY9YVZQoqnh7xLSW_20C-jqP7J4Wavmln5I1IIcAHjMrUOep63f3xal1Uk6jr6GDYFbENyL1wsao4i1ZGXekx0YBmAScvTcNeUjOpRSr_wCfwadzeEqCmslwad9Gqwh0FzuO8eCw0uSTw8zTS2FQblN9kBFCTCwKJUcC57KISWybaihVqJARh5dOnzScLIUVRMPBJ7eM-_an-StrXtQOfYnE-HEFGGhHWWEFWyXu6aBmugcWnK-RySv7mnStwJXu7DfhkKokL-5dowT4XZwZxlWhIdJ_wOqpg5ebZnJjf_qjeuez2zZIjzaNenE9maezyhs1xF5XmI3dKDETf3Il8XH6U6Ddui2Pzfse2wzTWpYiRPWYLC8le_kGiFc_xbOh8dzNFPC36Q7dqMokwMF_-8XX7pGlOa3DgKjvXtZQY1WTUUnuBRIBKaB0wjjhaxvoHH9Mr_g0EGmu4iqZsr5r86VmGMnBERwBjyy1AJRXLUpMP1xSR14v6Hoo9u2PIRg2trL-pkDFAzb4MpfzW4_bfA_--OOuJYx9DP0jfHBj8sBUDygwFSV6wW3u1-VwDTPz4Pdg
cbxday=0
cbxloc=0
cbxro=0
__RequestVerificationToken=CfDJ8BYucvB9EKBOrcxloaLPaJjJIRziyc6gXvXIepgMN_6DJVDTiQYmW-lKvbev-kq6uteAkdxPV3cjFW2HwbrEuQS0b-PZj1_D4gdUWMa3mQ5A6R2qu0tkxnJDHr7LTcrcJNC2TR5HZowiOy4KLsItoFhFTeVY6WU04OWblFrFiu4ifSTm6NyFMyco1J9RQWZw4g

When debugging the application the 'raw' and 'sanitised' are not equal even if, at a human eyes, the two strings are totally identical. I've copied the values of 'raw' and 'sanitised' and saved in two files, then compared with winmerge and are identical. Even if i compare the two string with 'raw!=sanitised', it return true. Using the .Net fiddle you suggested to test the HtmlSanitizer, manually input the given string, the two strings are equal. So there some not human readable character that are stripped by the sanitizer but that not result from the .ToString() method.

How can i find the changes to the input data, and why the two strings are considered different?

mganss commented 4 months ago

You can try to write a for loop comparing each character of both strings.

I have a feeling, though, you might not be using HtmlSanitizer at the intended level of abstraction. HtmlSanitizer is exclusively targeted at sanitizing HTML and in the above code it seems you're trying to sanitize the body of an HTTP POST request.

Assuming one of the posted values can contain HTML I would suggest first decoding the form values into key value pairs and then sanitizing only the values that may contain HTML.

edika99 commented 4 months ago

Yes, I'm using HtmlSanitizer to check for invalid posted data in some forms. I want to avoid that html/js code is posted. Is part of a AntiXss checker. Anyway It works well, and I've not encountered any problem until this request. Any of the values contain HTML, and the only values that change are the __RequestVerificationToken and gtoken. It could be useful to have a property or a method that returns only the sanitized data of the Sanitize() method, to inspect and analyze the invalid data, and configure better the sanitizer. In this case the changes made by the Sanitize() method are not visible to human, the two strings result identical but are not during the comparison made at runtime.

tiesont commented 4 months ago

My experience with strings that look the same but are failing some comparison usually leads to differences in character encoding.

As @mganss points out, though, you're no really using HtmlSanitizer in the correct context - it's not intended to be used in parsing a raw request body.

If you could attach an example that demonstrates what you're seeing, I could take a look, but I doubt this is an issue with this library.