rust-ammonia / ammonia

Repair and secure untrusted HTML
Apache License 2.0
524 stars 43 forks source link

Re-running ammonia on its own output gives different results #185

Open bcaller opened 1 year ago

bcaller commented 1 year ago

There are a few cases I've found where feeding the output of ammonia back into ammonia gives a different output.

I'm not sure if this means that the initial output is non-compliant or potentially unsafe or if you'd consider this not a bug.

Also it's possible the bug is entirely within html5ever, I'm not sure.

The first two examples are that entity decoding sometimes produces characters we want to remove or change in the second pass.

The later examples show the sanitizer wanting to move closing tags around.

Anyway, do you think it's worth running ammonia twice, or it's nothing to worry about?


HTML entity -> \r -> \n


\r
\n

HTML entity for BOM at start -> BOM at start -> nothing (OK this one I understand because we use the default TokenizerOpts with discard_bom)

&#65279!
\ufeff!
!

Anchor tag hopping around:

<a><table><a>
<a><a></a><table></table></a>
<a></a><a></a><table></table>
<h1><a><h6></a></h6>
<h1><a></a><h6><a></a></h6></h1>
<h1><a></a></h1><h6><a></a></h6>

Paragraph tags reproducing:

<p><svg><foreignobject><p>
<p><p></p></p>
<p></p><p></p><p></p>