rust-ammonia / ammonia

Repair and secure untrusted HTML
Apache License 2.0
524 stars 43 forks source link

Improper handling of boolean attributes? #186

Closed miketheman closed 1 year ago

miketheman commented 1 year ago

Hello! Thanks for the nifty library.

Apologies if this is reported elsewhere, or if I'm using the incorrect language. I've not done much Rust, and came to this library from the Python wrapper nh3, but I think I have an example in Rust that shows the issue.

When passing a value to tag_attributes to allow it through the cleansing cycle, it looks like the code doesn't currently know how to use boolean-type attributes.

One such instance I'm hitting is disabled tag attribute.

Then used in HTML, this often looks like:

  <input name="name" type="text" disabled />

However when passing this to ammonia, the output becomes:

  <input name="name" type="text" disabled="" />

Note the trailing =""

While valid HTML, and executes correctly in the browser I tested, it becomes a more verbose syntax, and unnecessarily adds characters to the output HTML.

Here's an example demonstrating the issue:

use ammonia::Builder;
use maplit::{hashmap, hashset};

fn main() {
    let input_html = "<p><input type=\"checkbox\" disabled></p>";

    let added_tags = hashset!["input"];
    let tag_attributes = hashmap![
        "input" => hashset!["type", "disabled"]
    ];

    let cleaned_html = Builder::default()
        .add_tags(added_tags)
        .tag_attributes(tag_attributes)
        .clean(&input_html)
        .to_string();

    assert_eq!(cleaned_html, input_html);
}

I found this listing of boolean attributes, updated July 2023 - it might be useful.

Alternately, if the source tag attribute has no = value, maybe don't add it?

notriddle commented 1 year ago

Fixing this would require patching html5ever::serialize to special-case boolean attributes. The implementation code is here.

Is this a blocking issue? I'd rather not take on patched copies of html5ever code if it's just there to trim a few bytes off a seldom-used feature.

miketheman commented 1 year ago

Thanks - it's not blocking just yet. We can decide if we want to adapt by adding trailing ="" to whatever boolean attributes we get.

Should this be re-opened against the underlying library instead?

notriddle commented 1 year ago

That might be the best place to do it, though since the html5 serialization algorithm that it implements doesn't call for this, they probably don't want to merge it.

For each attribute that the element has, append a U+0020 SPACE character, the attribute's serialized name as described below, a U+003D EQUALS SIGN character (=), a U+0022 QUOTATION MARK character ("), the attribute's value, escaped as described below in attribute mode, and a second U+0022 QUOTATION MARK character (").

miketheman commented 1 year ago

Hmm, I guess so 🤔 Since trailing empty ="" is indeed valid HTML5, we might simply accept that modification and move along. I'll discuss it with my pals next week.

Thanks for your assistance!

miketheman commented 1 year ago

Thanks for the pointers, I ended up adding the trailing ="" whenever necessary.