mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.66k stars 250 forks source link

Change elements to tags (#649) #693

Closed willkg closed 1 year ago

willkg commented 1 year ago

This cleans up some API weirdness where BleachSanitizerFilter subclasses html5lib's sanitizer filter and sets allowed_elements to something entirely different than html5lib's sanitizer filter's allowed_elements is. The formers is a list of strings. The latter is a frozenset of (namespace, element) tuples.

This fixes that by changing the BleachSanitizerFilter code to use allowed_tags everywhere. This eliminates the API confusion.

While doing this, I changed ALLOWED_TAGS to be a frozenset, changed allowed_tags to take a set of strings rather than a list of strings, and renamed strip_disallowed_elements to strip_disallowed_tags to be consistent with naming. This is a backwards incompatible change we'll need to mention in CHANGES and it means a major version change.

I also changed the __init__ for BleachSanitizerFilter so it no longer calls the html5lib sanitizer filter __init__ which kicks up the deprecation warning.

Fixes #649.

willkg commented 1 year ago

@g-k I think the API weirdness in issue #649 is worth fixing. This was the minimally viable way I could think of fixing it that didn't create a slew of backwards-incompatible changes.

Other options:

  1. Change allowed_tags to allowed_elements (and generally uses of the word "tags" to "elements", too) everywhere and change it to be a set of (namespace, element) tuples. This changes the argument name everywhere, but also, developers would have to add namespaces to their allowed_elements values. Further, this interacts with allowed_attributes in that the keys now need to have elements and callable values would need to support namespaces, too. I remember html5lib adding namespace support, but that it wasn't complete in some way, so this might not even work right. It's a lot of changes for us and a lot of changes for users.
  2. Pull in what little bits of code we're still using from html5lib sanitizer filter into our BleachSanitizerFilter so we're not longer subclassing it and then clean up the API. This is a lot of work for us plus we have to deal with how to pull in the code without violating license/copyright in a way that's maintainable.
  3. Adjust the constructors so that if allowed_tags is a list/set of (namespace, element) tuples, drop the namespaces. This feels like it creates a potential surprise for users and would warrant someone writing up another bug.
  4. Do nothing. I think this is worth fixing, so we could do nothing, but I'll feel bad for the rest of my life.

What do you think? Does this look ok?

willkg commented 1 year ago

Thank you!