Closed facundochaud-eb closed 1 year ago
Is this still a problem with the latest version of Bleach (4.1.0)? If so, can you provide a test case that illustrates what the issue is?
Yes, it's still a problem with the latest version. The pieces of code I shared here have not changed in those versions.
You could reproduce it like this:
>>> from bleach.sanitizer import BleachSanitizerFilter
>>> from html5lib.filters.sanitizer import Filter as HtmlSanitizerFilter
>>> stream = [{'type': 'Characters', 'data': 'title'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'type': 'StartTag', 'name': 'p', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': OrderedDict([((None, 'onclick'), "window.location.href='http://www.evil.com'")])}, {'type': 'Characters', 'data': 'Evil'}, {'type': 'EndTag', 'name': 'p', 'namespace': 'http://www.w3.org/1999/xhtml'}]
>>> [foo for foo in BleachSanitizerFilter(stream)]
[{'data': 'title', 'type': 'Characters'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'data': '<p onclick="window.location.href=\'http://www.evil.com\'">Evil</p>', 'type': 'Characters'}]
>>> [foo for foo in HtmlSanitizerFilter(stream)]
[{'type': 'Characters', 'data': 'title'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'type': 'Characters', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': '<p onclick="window.location.href=\'http://www.evil.com\'">'}, {'type': 'Characters', 'data': 'Evil'}, {'type': 'Characters', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': '</p>'}]
Our previous usage was:
from html5lib import serializer
from html5lib.filters import sanitizer
class HTMLSerializer(serializer.HTMLSerializer):
# Without this, the html5lib serializer will drop certain ending tags if
# they are deemed optional: <p>hi</p> -> <p>hi
omit_optional_tags = False
def _render_stream(stream, sanitize=False):
# Render a parsed and possibly filtered stream into a string
s = HTMLSerializer()
if sanitize:
stream = sanitizer.Filter(stream)
return s.render(stream)
That ended up returning title <p>Evil</p>
, but if I replace the sanitizer filter by bleach's one I get title <p onclick="window.location.href='http://www.evil.com'">Evil</p>
The BleachSanitizerFilter needs to work on a stream built by BleachHTMLTokenizer. What happens when you use that to parse whatever it is you're parsing?
The stream I was using was created from an html5lib.HTMLParser
. I replaced it with a BleachHTMLParser
which uses BleachHTMLTokenizer
. Got the same results
You mention alphabetized attributes. We landed a fix for #566. I think that fixes that issue for you, too.
Beyond that, this is a pretty detailed issue report and it's hard for me to wrap my head around what you're trying to report. I skimmed it, but that's clearly not sufficient. I won't have time to really pour over this any time soon.
Hmm main TLDR thing of this issue would be that self.allowed_elements
is incompatible between bleach and html5lib, and I propose 3 solutions for that. I can open a PR for the last proposal if you like, but I don't know if that would be the go-to option for you
There's no hurry in fixing this, we can just keep using old versions, I just wanted to get rid of the deprecation warnings 😅
Describe the bug
I was trying to replace
html5lib.filters.sanitizer.Filter
withbleach.sanitizer.BleachSanitizerFilter
(which inherits from the earlier one). We used that filter by just doingFilter(stream)
. And we find that we're getting all the tags alphabetized, which we don't really want.I found that to know if something needs to be alphabetized, there's this check: https://github.com/mozilla/bleach/blob/e4718bdda1c43849a8dfbda1b2915e8a6594aab2/bleach/sanitizer.py#L354
That is only set in the
html5lib.filters.sanitizer.Filter
__init__
function by sending anallowed_elements
. It also has some default ones. But the default value for that is a set of tuples(namespace, token name)
, not some iterable of token names: https://github.com/html5lib/html5lib-python/blob/f87487a4ada2d6cf223bdd182774a01ba3c84618/html5lib/filters/sanitizer.py#L31-L32And that is used like this: https://github.com/html5lib/html5lib-python/blob/f87487a4ada2d6cf223bdd182774a01ba3c84618/html5lib/filters/sanitizer.py#L808
I see that
allowed_elements
is overriden in theCleaner.clean(text)
method, and a list of tags is provided. https://github.com/mozilla/bleach/blob/e4718bdda1c43849a8dfbda1b2915e8a6594aab2/bleach/sanitizer.py#L179So that is consistent with the usage of the
BleachSanitizerFilter
sanitize_token
method, but that is not a consistent format with the parent.python and bleach versions (please complete the following information):
Expected behavior
I propose one of these 3 options:
BleachSanitizerFilter
should useallowed_elements
that are compatible withhtml5lib
ones, which would require a lot of refactoring and it's likely to not be backwards compatibleBleachSanitizerFilter
should force clients to send a new compatible set ofallowed_elements
by adding it as a required argument, as it's not really optional to not send it because it's not compatible with the usageBleachSanitizerFilter
should accept both formats ofallowed_elements
by doing something like this: