mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 252 forks source link

bug: pr #394 regressed AMP html / custom elements #560

Closed dgilman closed 1 year ago

dgilman commented 4 years ago

Describe the bug

The fix committed in #394 breaks custom HTML5 elements, notably Google's AMP. I don't think your whitelisting strategy is valid in the face of HTML 5 as custom components are gaining more popularity and are part of the standard.

python and bleach versions (please complete the following information):

Bleach 3 - regression is commit bea164a3a7e6ce2dfb826130e8faa6f447680dc1

To Reproduce

Steps to reproduce the behavior:

import bleach
foo = bleach.linkify('<amp-img layout="responsive" src="/static/img/local/coords_2.png" height="289" width="481"></amp-img>', callbacks=[], skip_tags=["amp-img"])
assert '&lt;' not in foo

Expected behavior

No assertion

Additional context

If you have feedback on how this can be fixed I will do a PR, although not this weekend. If it is possible to just hack around this by shoving our own valid list of HTML elements in somewhere I can do that too, and maybe do a documentation PR?

dgilman commented 4 years ago

As a workaround, I can confirm that instantiating your own Cleaner instance, passing the LinkifyFilter to that, and whitelisting the relevant AMP tags with that cleaner gets you around the issue.

g-k commented 3 years ago

Cool, I'm glad you found a workaround and appreciate you digging up the regression!

This comment from the PR that introduced the regression looks relevant:

This builds a list of all known HTML tags that html5lib knows about.

I thought about making this an argument, but that requires some new naming and tests and documenting work. I currently think if a user needs to specify tags, then they should be sanitizing and then linkifying.

https://github.com/mozilla/bleach/pull/394#discussion_r221985246

If you have time to work on a PR that:

I think that'd work.

willkg commented 1 year ago

We're not going to fix this and there's a valid way to get the desired behavior.