mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.51k stars 198 forks source link

Configuration option to bypass changes for security advisory #511

Closed jerriep closed 6 months ago

jerriep commented 6 months ago

I want to enquire whether it would be possible to add a configuration option to bypass the encoding that was done as part of https://github.com/mganss/HtmlSanitizer/commit/ab29319866c020f0cc11e6b92228cd8039196c6e.

I have a use case where it is valid for users to pass custom content that I do not want to be encoded. In my case, I produce an HTML document like the one below:

<p>
  This is a test case to see whether we can pass raw content to a Contentful
  Long Text (markdown) field
</p>
<script data-cp-raw-block="">
  <CollapsibleSection label="Click to expand!">
  Content that was hidden before!
  </CollapsibleSection>

  <TextBox>__Title__
  Your text here.
  </TextBox>
</script>
<p>This is some text after the table</p>

The markup inside the script tag is user-generated and must not be encoded as it must be passed precisely as-is to an external system. With the changes made in the commit linked above, the custom tags inside that script element are encoded, breaking things down the line for me.

Would it be possible to add a configuration option to bypass the encoding?

I am happy to submit a PR for this.

tiesont commented 6 months ago

Curious - how does the browser treat that script tag? The default type for <script> is text/javascript (according to the MDN), so it should be throwing up errors once it encounters markup (since it wouldn't be valid JavaScript).

jerriep commented 6 months ago

It is not sent to the browser like that. The HTML document is an intermediate format for processing. It ultimately gets sent to a Content Management System (CMS), which will process it and turn it into the correct HTML.

It is also not sent to the CMS inside a script block. It put it temporarily in the script block because that is the only HTML element I found where HtmlSanitizer or AngleSharp does not alter the contents (until the change in the commit I referenced). If I were to place it in a div, for example, the contents may be rejected because, as you pointed out, this may not be valid HTML. It may not even be something remotely like HTML - it may be JSON, as per the last video linked below.

This is also why I am not concerned about someone injecting malicious code because (a) it will be processed subsequently by my app and (b) it will be sanitized correctly by the CMS it is passed to.

For background, it is related to the feature described in the videos below:

tiesont commented 6 months ago

I assumed it was something like that.

Would something like this example work for you? https://github.com/mganss/HtmlSanitizer/wiki/Examples#ex4-encode-non-html-before-sanitizing

The only other existing option I can think of would be to handle the RemovingTag event: https://github.com/mganss/HtmlSanitizer/wiki/Hooks

jerriep commented 6 months ago

No, because I specifically do not want it encoded. It must be sent to the CMS precisely as the user entered it. You may argue that I can decode it again afterwards, but the block may contain content that then gets decoded by mistake. For example, the block may contain the text &lt;, which will then be altered erroneously during the subsequence decoding process.

tiesont commented 6 months ago

So, you're after something that could be referenced in private void DoSanitize(IHtmlDocument dom, IParentNode context, string baseUrl = "")?

So this:

// always encode text in raw data content
foreach (var tag in context.QuerySelectorAll("*")
    .Where(t => t is not IHtmlStyleElement
        && t.Flags.HasFlag(NodeFlags.LiteralText)
        && !string.IsNullOrWhiteSpace(t.InnerHtml)))
{
    var escapedHtml = tag.InnerHtml.Replace("<", "&lt;").Replace(">", "&gt;");
    if (escapedHtml != tag.InnerHtml)
        tag.InnerHtml = escapedHtml;
    if (tag.InnerHtml != escapedHtml) // setting InnerHtml does not work for noscript
        tag.SetInnerText(escapedHtml);
}

would change to something like:

if(EncodeRawDataContext)
{
    // Encode text in raw data content
    foreach (var tag in context.QuerySelectorAll("*")
        .Where(t => t is not IHtmlStyleElement
            && t.Flags.HasFlag(NodeFlags.LiteralText)
            && !string.IsNullOrWhiteSpace(t.InnerHtml)))
    {
        var escapedHtml = tag.InnerHtml.Replace("<", "&lt;").Replace(">", "&gt;");
        if (escapedHtml != tag.InnerHtml)
            tag.InnerHtml = escapedHtml;
        if (tag.InnerHtml != escapedHtml) // setting InnerHtml does not work for noscript
            tag.SetInnerText(escapedHtml);
    }
}

assuming EncodeRawDataContext is a new option which defaults to true according to the current behavior.

jerriep commented 6 months ago

Yep, something like that.

mganss commented 6 months ago

I wouldn't want an option that explicitly bypasses the encoding but perhaps we can make the encoding configurable, allowing the user to set an encoding func with the default being what is there now. For your use case you'd set it to a no-op. Perhaps this would enable other use cases as well.

It'd be great if you could provide a PR @jerriep 🙏🏻

jerriep commented 6 months ago

I appreciate your willingness to accept such a PR. A few questions before I get started:

  1. Should the func be for the entire encoding loop (i.e. https://github.com/mganss/HtmlSanitizer/blob/3443accc4354e871bf856ee8c5f6140f2f02e872/src/HtmlSanitizer/HtmlSanitizer.cs#L477-L487) or just the encoding of the individual elements (i.e. https://github.com/mganss/HtmlSanitizer/blob/3443accc4354e871bf856ee8c5f6140f2f02e872/src/HtmlSanitizer/HtmlSanitizer.cs#L482-L486). My suggestion would be for the latter.
  2. Do you have a preference/suggestion for the naming convention? Perhaps EncodeLiteralTextElementContentTags if we go with the 2nd option in (1)?
tiesont commented 6 months ago

@jerriep Just my two cents, but I think your suggestion for EncodeLiteralTextElementContentTags makes sense, or maybe EncodeLiteralTextElementContent if you're just replacing the inner loop's code (which makes sense to me). So

// always encode text in raw data content
foreach (var tag in context.QuerySelectorAll("*")
    .Where(t => t is not IHtmlStyleElement
        && t.Flags.HasFlag(NodeFlags.LiteralText)
        && !string.IsNullOrWhiteSpace(t.InnerHtml)))
{
    var escapedHtml = tag.InnerHtml.Replace("<", "&lt;").Replace(">", "&gt;");
    if (escapedHtml != tag.InnerHtml)
        tag.InnerHtml = escapedHtml;
    if (tag.InnerHtml != escapedHtml) // setting InnerHtml does not work for noscript
        tag.SetInnerText(escapedHtml);
}

becomes

// always encode text in raw data content
foreach (var tag in context.QuerySelectorAll("*")
    .Where(t => t is not IHtmlStyleElement
        && t.Flags.HasFlag(NodeFlags.LiteralText)
        && !string.IsNullOrWhiteSpace(t.InnerHtml)))
{
    EncodeLiteralTextElementContent(tag);
}

which, like @mganss mentioned, could be a no-op.

mganss commented 6 months ago

Thanks @jerriep for your effort. I prefer option 2 as in @tiesont's example above and EncodeLiteralTextElementContent.