mganss / HtmlSanitizer

Cleans HTML to avoid XSS attacks
MIT License
1.55k stars 200 forks source link

<tbody> elements are added to tables #432

Closed TopCoder02 closed 1 year ago

TopCoder02 commented 1 year ago

Not sure why it inserts when it sanitizes, when it wasn't present before. What am I missing.

tiesont commented 1 year ago

Can you include an example of your content and how you're sanitizing it, please?

Most likely, AngleSharp (the library which HtmlSanitizer uses to parse markup) is doing this because that's what browsers do - if you use a browser like Firefox, the dev tools will show you the actual markup which your browser is displaying, rather than what was given to it. For a table, that means a <tbody> is injected for any <tr>s present in the root <table> element (or the rows are merged into the existing tbody - been awhile since I've had to troubleshoot table issues, and it probably varies by vendor anyway).

TopCoder02 commented 1 year ago

This <table><tr><td>help</td></tr></table> is converted to <table><tbody><tr><td>help</td></tr></tbody></table>

In the command window

>? sanitizer.Sanitize("<table><tr><td>help</td></tr></table>")
"<table><tbody><tr><td>help</td></tr></tbody></table>"

To me the tbody is not necessary, just takes up extra bandwidth. It doesn't cause any harm other then bandwidth.

tiesont commented 1 year ago

The gist here is that if you think the behavior is wrong, it's unfortunately an upstream issue - AngleSharp follows browser behavior. Might be worth checking that project's issue board to see if anyone has a similar complaint and/or a fix.

mganss commented 1 year ago

I second what @tiesont said. FWIW here's the same example in the browser console:

let d = document.createElement("div");
d.innerHTML = '<table><tr><td>help</td></tr></table>';
d.innerHTML
-> '<table><tbody><tr><td>help</td></tr></tbody></table>'
TopCoder02 commented 1 year ago

Thanks, for taking the time to look into this. I'm just removing it after the sanitize process. I'm trying to write a tool to help people format html properly for email campaigns, part of that is HTML size, because a lot of companies charge egress bandwidth. If you have a big audience, those extra bytes tend to add up.