Can't sanitize full documents

tl24 commented 8 years ago

We're trying to sanitize full html documents and we're losing the outer parts of the document. From the looks of the code it is automatically wrapping the html text in a body tag before passing it through. Is there any way around this?

We're loading html emails into a browser window which is why they are full documents. We put them into an iframe so they don't mess with the surrounding page. The iframe is sandboxed, but it would be nice to have the peace of mind of knowing we tried to sanitize the html as well.

304NotModified commented 8 years ago

NB: I consider this as a bug.

mganss commented 8 years ago

This is similar to #58. Currently, we cannot sanitize full documents, only body fragments. This has to do with the way the AngleSharp HTML parser works. I agree that this is a bug that needs to be fixed. I'm not sure if this can be done w/o breaking the current API.

304NotModified commented 8 years ago

We're trying to sanitize full html documents and we're losing the outer parts of the document. From the looks of the code it is automatically wrapping the html text in a body tag before passing it through. Is there any way around this?

Isn't this change as unintended side-effects? We can fix this? (or make it optional at least?)

mganss commented 8 years ago

If you parse <script>xyz</script> the parser constructs the following document:

<html><head><script>alert('Hello world!')</script></head><body></body></html>

If you parse <div>xyz</div><script>xyz</script> the result is

<html><head></head><body><div>xyz</div><script>xyz</script></body></html>

Previous to #58, we lost the script node in the first example because it was parsed into the head and we were rendering only the body to avoid getting back the generated "boilerplate" that was not in the input. The fix for #58 was to wrap the input in a body element to force all input into the body.

One possible solution I see is to have an additional SanitizeDocument() method that doesn't add a body tag and renders out the whole document (not only the body). We'd also have to add at least html, head, body to the list of tags allowed by default.

Thoughts?

tiesont commented 8 years ago

Sanitizing full documents seems like asking for trouble, or at the very least adding a fair amount of complexity.

For instance, this is a completely safe bit of markup:

<script type="text/template">
    <p>I'm a template! Woooooooo!</p>
</script>

Because it's a <script> tag, it gets removed by default. You can't simply add <script> as an allowed tag, since you then lose most of the purpose of the library. To allow those template scripts, you have to subscribe to the RemovingTag event and check for the text/template type attribute.

This is obviously only one example, and it becomes a moot argument when <template> has better support, but it seemed worth chiming in...

304NotModified commented 8 years ago

One possible solution I see is to have an additional SanitizeDocument() method that doesn't add a body tag and renders out the whole document (not only the body). We'd also have to add at least html, head, body to the list of tags allowed by default.

I think this is a good solution.

@tiesont good example, but I don't see any trouble with it. Just subscribe to the events and it did work.

mganss commented 8 years ago

I've implemented this on the document branch and made a beta release on NuGet.

@tiesont Don't get me wrong: I have no intention to allow script tags in the head (nor body), I just want to give users the chance to allow them according to their own rules. Previously, when you had this:

<html><head><title>Test</title></head><body><div>Test</div></body></html>

you'd get back <div>Test</div>. Now you can pass this into SanitizeDocument() and you'll get back

<html><head></head><body><div>Test</div></body></html>

I didn't even allow the <title> tag by default (should we? if so, which other ones that are only relevant in the head?).

304NotModified commented 8 years ago

I didn't even allow the <title> tag by default (should we? if so, which other ones that are only relevant in the head?).

JavaScript execution via <TITLE> tag on Inernet Explorer 9

Internet Explorer 9 allows execution of JavaScript via onpropertychange event handler on <title> tags if another <title> tag follows up - having at least one valid attribute. This vector works in IE6-8 Standards mode and in IE9 quirks mode.

https://html5sec.org/#107

mganss commented 8 years ago

JavaScript execution via <TITLE> tag on Inernet Explorer 9

Yeah, but it's only due to the onpropertychange attribute not the title tag per se. Anyway, I'll leave it as is for now (not allowing <title> nor any other additional tags besides html, head, and body).

@tl24 Does the beta work for your use case?

tl24 commented 8 years ago

I'll try and pull down the beta today and test.

Thanks!

tl24 commented 8 years ago

Ok, tried it out and it works for me. I had to add a few things, which I'll post here in case you want to make them part of the defaults:

            sanitizer.AllowedAttributes.Add("class");
            sanitizer.AllowedAttributes.Add("id");
            sanitizer.AllowedTags.Add("style");
            sanitizer.AllowedCssProperties.Add("text-overflow");

When will this be in a non-beta release to nuget?

Thanks!

mganss commented 8 years ago

I've released a new stable version to NuGet. I won't add more tags and attributes to the defaults for now. Careful: whole style sheets within <style> will not be sanitized currently, only inline styles.

mganss commented 8 years ago

Oops, 3.2.100 wasn't actually marked as stable on NuGet :blush: 3.2.103 now is. I've also added sanitization of style sheets (allowed at-rules configurable).

mganss / HtmlSanitizer

Can't sanitize full documents #63