Closed jerriep closed 6 months ago
Curious - how does the browser treat that script tag? The default type
for <script>
is text/javascript
(according to the MDN), so it should be throwing up errors once it encounters markup (since it wouldn't be valid JavaScript).
It is not sent to the browser like that. The HTML document is an intermediate format for processing. It ultimately gets sent to a Content Management System (CMS), which will process it and turn it into the correct HTML.
It is also not sent to the CMS inside a script block. It put it temporarily in the script block because that is the only HTML element I found where HtmlSanitizer or AngleSharp does not alter the contents (until the change in the commit I referenced). If I were to place it in a div, for example, the contents may be rejected because, as you pointed out, this may not be valid HTML. It may not even be something remotely like HTML - it may be JSON, as per the last video linked below.
This is also why I am not concerned about someone injecting malicious code because (a) it will be processed subsequently by my app and (b) it will be sanitized correctly by the CMS it is passed to.
For background, it is related to the feature described in the videos below:
I assumed it was something like that.
Would something like this example work for you? https://github.com/mganss/HtmlSanitizer/wiki/Examples#ex4-encode-non-html-before-sanitizing
The only other existing option I can think of would be to handle the RemovingTag event: https://github.com/mganss/HtmlSanitizer/wiki/Hooks
No, because I specifically do not want it encoded. It must be sent to the CMS precisely as the user entered it. You may argue that I can decode it again afterwards, but the block may contain content that then gets decoded by mistake. For example, the block may contain the text <
, which will then be altered erroneously during the subsequence decoding process.
So, you're after something that could be referenced in private void DoSanitize(IHtmlDocument dom, IParentNode context, string baseUrl = "")
?
So this:
// always encode text in raw data content
foreach (var tag in context.QuerySelectorAll("*")
.Where(t => t is not IHtmlStyleElement
&& t.Flags.HasFlag(NodeFlags.LiteralText)
&& !string.IsNullOrWhiteSpace(t.InnerHtml)))
{
var escapedHtml = tag.InnerHtml.Replace("<", "<").Replace(">", ">");
if (escapedHtml != tag.InnerHtml)
tag.InnerHtml = escapedHtml;
if (tag.InnerHtml != escapedHtml) // setting InnerHtml does not work for noscript
tag.SetInnerText(escapedHtml);
}
would change to something like:
if(EncodeRawDataContext)
{
// Encode text in raw data content
foreach (var tag in context.QuerySelectorAll("*")
.Where(t => t is not IHtmlStyleElement
&& t.Flags.HasFlag(NodeFlags.LiteralText)
&& !string.IsNullOrWhiteSpace(t.InnerHtml)))
{
var escapedHtml = tag.InnerHtml.Replace("<", "<").Replace(">", ">");
if (escapedHtml != tag.InnerHtml)
tag.InnerHtml = escapedHtml;
if (tag.InnerHtml != escapedHtml) // setting InnerHtml does not work for noscript
tag.SetInnerText(escapedHtml);
}
}
assuming EncodeRawDataContext
is a new option which defaults to true
according to the current behavior.
Yep, something like that.
I wouldn't want an option that explicitly bypasses the encoding but perhaps we can make the encoding configurable, allowing the user to set an encoding func with the default being what is there now. For your use case you'd set it to a no-op. Perhaps this would enable other use cases as well.
It'd be great if you could provide a PR @jerriep 🙏🏻
I appreciate your willingness to accept such a PR. A few questions before I get started:
EncodeLiteralTextElementContentTags
if we go with the 2nd option in (1)?@jerriep Just my two cents, but I think your suggestion for EncodeLiteralTextElementContentTags
makes sense, or maybe EncodeLiteralTextElementContent
if you're just replacing the inner loop's code (which makes sense to me). So
// always encode text in raw data content
foreach (var tag in context.QuerySelectorAll("*")
.Where(t => t is not IHtmlStyleElement
&& t.Flags.HasFlag(NodeFlags.LiteralText)
&& !string.IsNullOrWhiteSpace(t.InnerHtml)))
{
var escapedHtml = tag.InnerHtml.Replace("<", "<").Replace(">", ">");
if (escapedHtml != tag.InnerHtml)
tag.InnerHtml = escapedHtml;
if (tag.InnerHtml != escapedHtml) // setting InnerHtml does not work for noscript
tag.SetInnerText(escapedHtml);
}
becomes
// always encode text in raw data content
foreach (var tag in context.QuerySelectorAll("*")
.Where(t => t is not IHtmlStyleElement
&& t.Flags.HasFlag(NodeFlags.LiteralText)
&& !string.IsNullOrWhiteSpace(t.InnerHtml)))
{
EncodeLiteralTextElementContent(tag);
}
which, like @mganss mentioned, could be a no-op.
Thanks @jerriep for your effort. I prefer option 2 as in @tiesont's example above and EncodeLiteralTextElementContent
.
I want to enquire whether it would be possible to add a configuration option to bypass the encoding that was done as part of https://github.com/mganss/HtmlSanitizer/commit/ab29319866c020f0cc11e6b92228cd8039196c6e.
I have a use case where it is valid for users to pass custom content that I do not want to be encoded. In my case, I produce an HTML document like the one below:
The markup inside the
script
tag is user-generated and must not be encoded as it must be passed precisely as-is to an external system. With the changes made in the commit linked above, the custom tags inside thatscript
element are encoded, breaking things down the line for me.Would it be possible to add a configuration option to bypass the encoding?
I am happy to submit a PR for this.