rust-ammonia / ammonia

Repair and secure untrusted HTML
Apache License 2.0
523 stars 43 forks source link

How to avoid replacing special characters like `&` ? #184

Closed Nutomic closed 1 year ago

Nutomic commented 1 year ago

The clean function in ammonia 3.3 unexpectedly replaces HTML special characters like & with &amp;, same for <, > and so on. Based on the docs and code, this should only happen with the clean_text function which Im not invoking at all. In fact I cant tell where these character replacements are happening or how to disable them. Any hints?

Edit: To reproduce, the following assertion fails but I would expect it to pass:

let sanitized = ammonia::clean("&!<>?");
assert_eq!(sanitized, "&!<>?");
notriddle commented 1 year ago

This doesn't seem like a bug.

Both clean functions are supposed to accept HTML, not unformatted text, and they produce HTML as their output. You shouldn't be sending markdown through them.

For your usecase, I recommend sending your text through the markdown processor first, then sending its output HTML through ammonia.

Nutomic commented 1 year ago

That would only work if ammonia respects <pre> tags and doesnt change anything inside them. Is that the case?

notriddle commented 1 year ago

Why would <pre> tags not change anything inside them? I don't see anything special about them in the html parser [^except] or serialization rules for them. That's why, if you write this HTML:

[^except]: <pre> tags aren't allowed to be nested within paragraph tags, so they have a special parser rule for closing any unclosed <p> tags. The important part is that the tokenizer mode isn't changed the same way, for example, <script> does.

<pre><a href="https://rust-lang.org">Rust</a></pre>

And you get this:

Rust
Nutomic commented 1 year ago

That might be true in HTML, but when you write a markdown code block like the following, you expect it to be rendered unchanged, which Github indeed does.

<script>alert("xss example");</script>

Or

docker-compose pull && docker-compose up -d

We also have a different, but related problem because we need to escape the same string twice in some cases. This is necessary because each time it is untrusted in a different context. Which results in & turning to &amp;amp;.

All in all its really complicated to find a proper solution.

lnicola commented 1 year ago

Can you wrap the code blocks in <![CDATA[]]>?

notriddle commented 1 year ago

If I write a literal <script> tag in GFM, it gets filtered out before being shown in the web page (of course), but it's actually still there in the database, and I can see it in the API.

See the comment thread here: https://api.github.com/repos/rust-ammonia/ammonia/issues/comments/1665839239 In other words, their API deals in raw, unmodified user data with no escaping, and it eventually gets escaped *exactly once*. No double-escaping. (probably; it's proprietary software, so I can't actually check, but that's what it looks like it does)

Nutomic commented 1 year ago

Im closing this issue because ammonia isnt the right tool for our use case. Instead we are going to manually replace special html characters with escape codes.