mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.51k stars 198 forks source link

Url extra escaping #533

Open andrewQwer opened 4 months ago

andrewQwer commented 4 months ago

Hi, I'm using HtmlSanitizer for markup sanitizing and after library update from 5.x to 8.x & sign in URLs got escaped. The problem is that I can't catch where it happens.

I have the following code:

var gj = new HtmlSanitizer
 {
     OutputFormatter = HtmlMarkupFormatter.Instance,
     AllowDataAttributes = true
 };
gj.Sanitize("<img src='http://foobar.com?x=5&y-6'>")

Outputs is: <img src="http://foobar.com?x=5&amp;y-6"> - &amp; appeared.

I tried to do the following:

gj.FilterUrl += (object o, FilterUrlEventArgs e) => {
 Console.WriteLine(e.OriginalUrl); //shows <img src='http://foobar.com?x=5&y-6'>
 Console.WriteLine(e.SanitizedUrl); // shows <img src='http://foobar.com?x=5&y-6'>
}

So in this event both variables are the same, so no chance of fixing it at this stage.

Ok, I tried the following:

gj.PostProcessDom += (sender, args) =>
. {
.     var doc = args.Document;
.     var imgNodes= doc.QuerySelectorAll("img");
.     foreach (var imgNode in imgNodes)
.     {
.         Console.WriteLine("SRC in DOC:" + imgNode.GetAttribute("src")); //shows SRC in DOC: http://foobar.com?x=5&y-6
.     }
. };

So even post process event doesn't have this node escaped. Same is actual for PostProcessNode event.

What can I do else to get back URLs in src/href attributes to it's original unescaped value?

tiesont commented 4 months ago

Possibly relevant, although not a fix: #401

andrewQwer commented 4 months ago

Possibly relevant, although not a fix: #401

Yes, indeed. I'm ok to escape it, but I would like to have a chance to fix it somehow, at least in events. 'FilterUrl' event seems the most logical place, but at the moment event fires url is still unescaped.

Also found this issue:

https://github.com/AngleSharp/AngleSharp/issues/348

andrewQwer commented 4 months ago

For now I wrote the following fix using custom OutputFormatter: https://dotnetfiddle.net/wbtvUI, but let me know if there is another way to catch escaped values in events.