rust-ammonia / ammonia

Repair and secure untrusted HTML
Apache License 2.0
524 stars 43 forks source link

Incorrect html entity detection #187

Closed mjanda closed 1 year ago

mjanda commented 1 year ago

Hello, it seems like &curren without ending ; is rewritten to ¤?

let foo = "xxx&currenxxx";
let bar = Builder::new().clean(&foo);
println!("{:?}", bar);

Outputs Document(xxx¤xxx)

Same with &pound -> Document(xxx£xxx) and probably many more...

ammonia 3.3.0

notriddle commented 1 year ago

When I run the following series of commands:

$ cat > test.html
let foo = "xxx&currenxxx";
let bar = Builder::new().clean(&foo);
println!("{:?}", bar);
$ firefox ./test.html

I get this screenshot, showing that Firefox agrees with ammonia's parsing.

image

Ammonia is supposed to parse its input the same way a Web browser would. If you want to escape special characters, instead of escaping tags, you need to use the clean_text function instead of clean.

mjanda commented 1 year ago

Ouch. Thank you for explanation. That's unfortunate. I provided simplest test case I could, but my problems actually occured when user pasted link:

https://api.paylibo.com/paylibo/generator/czech/image?compress=false&size=440&accountPrefix=123&accountNumber=3328001&bankCode=0710&amount=50&currency=CZK&message=Cukimu+na+pono%C5%BEky

Running it through ammonia makes it malformed :-/

From what I've been able to find, optional semicolon is supported only for legacy reasons and current standard makes them mandatory https://html.spec.whatwg.org/multipage/syntax.html#character-references

Would it be possible to support this more strict parsing? Or is it inherited behavior from html library and ammonia can do nothing about it?

notriddle commented 1 year ago

I don't think the HTML library is flexible enough to turn individual fixups like this off

I can make it produce validation errors. Would that work?

Alternatively, you could use regexes to run your own tweaks to the HTML before feeding it to Ammonia. The important part is that any bespoke transformations should be done before feeding it into Ammonia; that way, if there's a bug in your transformation code, it won't potentially cause an XSS vulnerability.

Here's a Playground with what I'd do if I wanted to accept user-supplied HTML from a form or something:

https://play.rust-lang.org/?version=stable&mode=debug&edition=2021&gist=bfae96f77d5fa0f4b63cd4ae73581344

mjanda commented 1 year ago

I will explore how difficult it would be to first apply my transformations before ammonia. Now, I'm doing it after.

Thank you for explanation. I'll close this issue as it clearly isn't bug and there is no way to easily solve it here.