Strange sanitizing of <style>, <title> and <iframe> tags

andrewQwer commented 7 months ago

Hi, we are trying to sanitize our user input and want to erase all the tags keeping other text unchanged. But sanitizer behaves differently depending on what tag goes first.

Class initialization looks like this:

sanitizer = new HtmlSanitizer { KeepChildNodes = true };
sanitizer .AllowDataAttributes = false;
sanitizer .AllowedAtRules.Clear();
sanitizer .AllowedAttributes.Clear();
sanitizer .AllowedCssProperties.Clear();
sanitizer .AllowedSchemes.Clear();
sanitizer .AllowedTags.Clear();
sanitizer .UriAttributes.Clear();

In all examples the same Sanitize method is called:

some text <div> <img> text continuation - this in an input string. Output is -> some text text continuation - this is what we actually expect from the library.
some text <style> <img> text continuation Output is -> some text <img> text continuation - image tag remained for some reason. Same behavior with title tag as well. Why? Same happens if you wrap <img> tag into style or title completely like this: some text <style> <img></style> text continuation - img tag will remain.
some text <iframe> <img> text continuation Output is -> some text &lt;img&gt; text continuation - image tag remained but was additionally escaped. Better than p.2 but still unclear why output differs from p.1.

I haven't found any other tags except style, title and iframe that would cause this strange behavior. Other tags I checked are sanitized like in p.1

Can you please advise what settings combination should be used to always have results like in p.1 or at least like in p.3?

tiesont commented 7 months ago

What you're describing in #.2 is invalid HTML - I assume AngleSharp is trying to correct for that.

andrewQwer commented 7 months ago

What you're describing in #.2 is invalid HTML - I assume AngleSharp is trying to correct for that.

The attacker doesn't care about HTML structure, he wants to break XSS protection and provided cases allow to do it more efficiently :) Would be good to know whether it's intended behavior and we should take care of thisby ourselves somehow or should it be fixed in the future?

tiesont commented 7 months ago

What you're describing in #.2 is invalid HTML - I assume AngleSharp is trying to correct for that.

The attacker doesn't care about HTML structure, he wants to break XSS protection and provided cases allow to do it more efficiently :) Would be good to know whether it's intended behavior and we should take care of thisby ourselves somehow or should it be fixed in the future?

The point was, AngleSharp is going to try to do what your browser would do, which is correct invalid markup.

The result you're getting is an encoded img tag, not an actual HTML tag, so I'm not sure what the concern is (aside from having a result that shows <img> to wherever this content is rendered).

Is there something which you think is broken here, I guess is what I'm asking?

andrewQwer commented 7 months ago

Is there something which you think is broken here, I guess is what I'm asking?

I would say that all these cases should be processed in the same way when KeepChildNodes = true, or at least very similar.

But we have 3 different results by just replacing parent tag:

some text <div> <img> text continuation - why is <img/> erased here if we keep child nodes? some text <style> <img> text continuation - not clear why in this case <img/> remains, some text <iframe> <img> text continuation - no clear why in this case <img/> remains and additionally escaped (see my first message)? Btw same works if you replace iframe with script tag, <img/> remains and additionally escaped.

We use HtmlSanitize for sanitizing user text input and such behavior creates a lot of combinations to embed XSS that won't be erased. If we set KeepChildNodes = false it works the same way for every mentioned case but it removes everything that goes after parent tag.

tiesont commented 7 months ago

<img/> doesn't remain (not as an HTML tag, anyway) - it's just HTML-encoded content.

If I had to guess, my theory would be that AngleSharp considers anything following the style or title tag as "content" (since neither allows HTML) and encodes anything that has an HTML entity equivalent. Once it's done that, HtmlSanitizer won't process the "image" as HTML, because, well, it's not - at that point, it's just text.

Not sure how this output would be an XSS vector - you'd have to decode <img> back to <img> prior to handing the content off to whatever, and that's not really in scope for something like HtmlSanitizer.

As for

why is erased here if we keep child nodes

the img element is self-terminating, so there's no child content to preserve. If you want to keep things like <br>, <hr>, and <img>, then they need to be included in the allowed tags list.

JUST TO CLARIFY: I am not a maintainer or member of this project, I just use this pretty heavily in my projects, so I try to contribute back if I can. It's entirely possible that I'm misunderstanding what you're trying to do, but I don't think so. If I am, @mganss will probably jump in at some point and can explain what's happening, better.

You might also consider posting a related question in the AngleSharp repo, since that's ultimately the driver behind what you're seeing, but if you do so, I'd suggest limiting the question to why tags within elements like <style> or <title> are escaped.

andrewQwer commented 7 months ago

Thanks for your thoughts @tiesont. You understood correctly what I tried to explain.

Not sure how this output would be an XSS vector - you'd have to decode <img> back to prior to handing the content off to whatever, and that's not really in scope for something like HtmlSanitizer.

correct, that is what we are doing due to our internal logic as 99% of app is React based (and xss safe), but for this 1% of third party libraries it doesn't work sometimes and XSS may happen due to unescape 😄

doesn't remain (not as an HTML tag, anyway) - it's just HTML-encoded content....

Your explanation makes sense except iframe case. With iframe, img tag is additionally escaped (double escaping) that is weird. So these discrepancies after sanitization make our lives more complex, as we don't consider text input as HTML but just text that an attacker can use to embed some HTML content that will be unescaped and rendered on UI later. That's why we sanitize it, but our expectation was that sanitization happens the same way for whole text no matter what HTML tricks an attacker may apply.

In general, it doesn't matter if all mentioned cases are sanitized differently until you don't unescape them back like we do.

tiesont commented 7 months ago

Yeah, <iframe> is an odd little duckling - the spec allows it to have a closing tag, so <iframe></iframe> is completely valid, but the spec also forbids it to have child content, so it's a void element like <img>. I suspect that makes it an edge case that AngleSharp handles differently. Most likely, there's probably a double-pass on the parsing of that tag, so the already-encoded img tag gets encoded again (so the ampersands are converted to &).

Again, just conjecture based on past observed behavior (although my use-cases are much simpler, so I haven't run into anything like you're describing here).

mganss commented 7 months ago

As @tiesont pointed out, elements like <style> and <title> are raw text elements whose content is not interpreted as HTML by a browser. Try the following in a browser:

var d = document.createElement("div");
d.innerHTML = '<title><img></title>';
d.innerHTML
-> '<title>&lt;img&gt;</title>'

To eliminate XSS potential surrounding text content HtmlSanitizer encodes any text content of raw text elements. There was a recent vulnerability in HtmlSanitizer that was caused by a bug in AngleSharp and allowed a bypass when foreign content elements were allowed, see https://github.com/mganss/HtmlSanitizer/security/advisories/GHSA-43cp-6p3q-2pc4

I'll have to look into the <iframe> case. I suspect there is some double encoding going on between AngleSharp and HtmlSanitizer.

mganss commented 7 months ago

The double encoding of <iframe> content occurs when the iframe element is replaced with its child nodes if KeepChildNodes is true. The encoding step of text content occurs before that so when the replace operation is done the iframe element's content has already been encoded to <img> text continuation.

As @tiesont mentioned, iframe is not permitted to have content so I wouldn't consider the current behavior a bug. OTOH I'm wondering what HtmlSanitizer should do in a case like this. Perhaps we should always remove the content of iframe elements?

tiesont commented 7 months ago

@mganss That's probably something that gets tricky - since <iframe> doesn't require a closing tag, how do you determine what's content that the author intended to be embedded in the iframe (even though it's not really allowed) versus text that just happens to follow the iframe tag, if the closing tag is omitted?

I'll do some experimenting, but do you know what happens if something invalid like <img src="image.png">I have content!</img> is passed in? I've...actually seen things like that in student work, sadly enough...

andrewQwer commented 7 months ago

Hi again, thanks for the explanation, we will try to handle it by our own, but I have a question. Is it possible to add some option like KeepTextContent=true that would keep text data stored inside a tag when KeepChildNodes = false? So in the following example some text <img>img text</img> would be resulted as some text img text? We don't mind to remove tags but we need to keep text inside it 😄 AngleSharp has something that is called TextContent so maybe it's possible to keep it when tag is removed.

mganss commented 7 months ago

@tiesont It seems <iframe> does require a closing tag:

Tag omission: None, both the starting and ending tag are mandatory.

That's why AngleSharp continues to parse what follows the opening <iframe> as its text content. Even though elements like iframe and img can't have content, the corresponding property in AngleSharp's DOM does have a value.

mganss commented 7 months ago

@andrewQwer Yes, that might be possible. Will experiment in the next few days.

andrewQwer commented 7 months ago

@andrewQwer Yes, that might be possible. Will experiment in the next few days.

This would be great and potentially allow us to use KeepChildNodes = false and keep text inside nodes.

mganss commented 7 months ago

I'm considering to change the processing order so that the encoding of text content occurs after the removal of disallowed elements. This would mean the double encoding in the iframe case would no longer occur, i.e. the output of <iframe><img></iframe> would be <iframe><img></iframe>.

Does that facilitate the use case you have in mind @andrewQwer? I'd prefer not to add a special property like KeepTextContent because that would be a breaking change if added to the IHtmlSanitizer interface.

andrewQwer commented 7 months ago

I'm considering to change the processing order so that the encoding of text content occurs after the removal of disallowed elements. This would mean the double encoding in the iframe case would no longer occur, i.e. the output of <iframe><img></iframe> would be <iframe><img></iframe>.

Does that facilitate the use case you have in mind @andrewQwer? I'd prefer not to add a special property like KeepTextContent because that would be a breaking change if added to the IHtmlSanitizer interface.

Hi @mganss. At least this behavior will follow other tags behavior. As for expectations I would better remove child nodes in all cases no matter what parent tag is, like I showed in initial message for div tag. But I understand that this could not be possible due to how AngleSharp treats different tags.

mganss commented 7 months ago

You can perform all kinds of processing using the RemovingTag event, e.g. you could replace an element with its text content, essentially the same as the KeepTextContent property you proposed. Something like this (untested):

sanitizer.RemovingTag += (s, e) =>
{
    var tag = e.Tag;
    var dom = tag.GetAncestor<IHtmlDocument>();
    if (dom != null)
    {
        var txt = tag.TextContent;
        var txtNode = dom.CreateTextNode(txt);
        tag.Replace(txtNode);
        e.Cancel = true;
    }
};

andrewQwer commented 6 months ago

You can perform all kinds of processing using the RemovingTag event, e.g. you could replace an element with its text content, essentially the same as the KeepTextContent property you proposed. Something like this (untested):
sanitizer.RemovingTag += (s, e) =>
{
    var tag = e.Tag;
    var dom = tag.GetAncestor<IHtmlDocument>();
    if (dom != null)
    {
        var txt = tag.TextContent;
        var txtNode = dom.CreateTextNode(txt);
        tag.Replace(txtNode);
        e.Cancel = true;
    }
};

Thanks! Works as expected. Just added extra Sanitize for TextContent to remove tags that remain inside style/title tags and extra unescape for the specific cases with iframe that double escapes tags inside.

mganss / HtmlSanitizer

Strange sanitizing of <style>, <title> and <iframe> tags #494