Closed tl24 closed 8 years ago
NB: I consider this as a bug.
This is similar to #58. Currently, we cannot sanitize full documents, only body fragments. This has to do with the way the AngleSharp HTML parser works. I agree that this is a bug that needs to be fixed. I'm not sure if this can be done w/o breaking the current API.
We're trying to sanitize full html documents and we're losing the outer parts of the document. From the looks of the code it is automatically wrapping the html text in a body tag before passing it through. Is there any way around this?
Isn't this change as unintended side-effects? We can fix this? (or make it optional at least?)
If you parse <script>xyz</script>
the parser constructs the following document:
<html><head><script>alert('Hello world!')</script></head><body></body></html>
If you parse <div>xyz</div><script>xyz</script>
the result is
<html><head></head><body><div>xyz</div><script>xyz</script></body></html>
Previous to #58, we lost the script node in the first example because it was parsed into the head and we were rendering only the body to avoid getting back the generated "boilerplate" that was not in the input. The fix for #58 was to wrap the input in a body element to force all input into the body.
One possible solution I see is to have an additional SanitizeDocument()
method that doesn't add a body tag and renders out the whole document (not only the body). We'd also have to add at least html, head, body
to the list of tags allowed by default.
Thoughts?
Sanitizing full documents seems like asking for trouble, or at the very least adding a fair amount of complexity.
For instance, this is a completely safe bit of markup:
<script type="text/template">
<p>I'm a template! Woooooooo!</p>
</script>
Because it's a <script>
tag, it gets removed by default. You can't simply add <script>
as an allowed tag, since you then lose most of the purpose of the library. To allow those template scripts, you have to subscribe to the RemovingTag
event and check for the text/template
type attribute.
This is obviously only one example, and it becomes a moot argument when <template>
has better support, but it seemed worth chiming in...
One possible solution I see is to have an additional SanitizeDocument() method that doesn't add a body tag and renders out the whole document (not only the body). We'd also have to add at least html, head, body to the list of tags allowed by default.
I think this is a good solution.
@tiesont good example, but I don't see any trouble with it. Just subscribe to the events and it did work.
I've implemented this on the document branch and made a beta release on NuGet.
@tiesont Don't get me wrong: I have no intention to allow script tags in the head (nor body), I just want to give users the chance to allow them according to their own rules. Previously, when you had this:
<html><head><title>Test</title></head><body><div>Test</div></body></html>
you'd get back <div>Test</div>
. Now you can pass this into SanitizeDocument()
and you'll get back
<html><head></head><body><div>Test</div></body></html>
I didn't even allow the <title>
tag by default (should we? if so, which other ones that are only relevant in the head?).
I didn't even allow the
<title>
tag by default (should we? if so, which other ones that are only relevant in the head?).
JavaScript execution via
<TITLE>
tag on Inernet Explorer 9Internet Explorer 9 allows execution of JavaScript via onpropertychange event handler on
<title>
tags if another<title>
tag follows up - having at least one valid attribute. This vector works in IE6-8 Standards mode and in IE9 quirks mode.
JavaScript execution via
<TITLE>
tag on Inernet Explorer 9
Yeah, but it's only due to the onpropertychange
attribute not the title tag per se. Anyway, I'll leave it as is for now (not allowing <title>
nor any other additional tags besides html, head, and body).
@tl24 Does the beta work for your use case?
I'll try and pull down the beta today and test.
Thanks!
Ok, tried it out and it works for me. I had to add a few things, which I'll post here in case you want to make them part of the defaults:
sanitizer.AllowedAttributes.Add("class");
sanitizer.AllowedAttributes.Add("id");
sanitizer.AllowedTags.Add("style");
sanitizer.AllowedCssProperties.Add("text-overflow");
When will this be in a non-beta release to nuget?
Thanks!
I've released a new stable version to NuGet. I won't add more tags and attributes to the defaults for now. Careful: whole style sheets within <style>
will not be sanitized currently, only inline styles.
Oops, 3.2.100 wasn't actually marked as stable on NuGet :blush: 3.2.103 now is. I've also added sanitization of style sheets (allowed at-rules configurable).
We're trying to sanitize full html documents and we're losing the outer parts of the document. From the looks of the code it is automatically wrapping the html text in a body tag before passing it through. Is there any way around this?
We're loading html emails into a browser window which is why they are full documents. We put them into an iframe so they don't mess with the surrounding page. The iframe is sandboxed, but it would be nice to have the peace of mind of knowing we tried to sanitize the html as well.