mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.65k stars 253 forks source link

bug: allowed_elements in BleachSanitizerFilter is different from html5lib_shim.SanitizerFilter #649

Closed facundochaud-eb closed 1 year ago

facundochaud-eb commented 2 years ago

Describe the bug

I was trying to replace html5lib.filters.sanitizer.Filter with bleach.sanitizer.BleachSanitizerFilter (which inherits from the earlier one). We used that filter by just doing Filter(stream). And we find that we're getting all the tags alphabetized, which we don't really want.

I found that to know if something needs to be alphabetized, there's this check: https://github.com/mozilla/bleach/blob/e4718bdda1c43849a8dfbda1b2915e8a6594aab2/bleach/sanitizer.py#L354

That is only set in the html5lib.filters.sanitizer.Filter __init__ function by sending an allowed_elements. It also has some default ones. But the default value for that is a set of tuples (namespace, token name), not some iterable of token names: https://github.com/html5lib/html5lib-python/blob/f87487a4ada2d6cf223bdd182774a01ba3c84618/html5lib/filters/sanitizer.py#L31-L32

And that is used like this: https://github.com/html5lib/html5lib-python/blob/f87487a4ada2d6cf223bdd182774a01ba3c84618/html5lib/filters/sanitizer.py#L808

            if ((namespace, name) in self.allowed_elements or
                (namespace is None and
                 (namespaces["html"], name) in self.allowed_elements)):

I see that allowed_elements is overriden in the Cleaner.clean(text) method, and a list of tags is provided. https://github.com/mozilla/bleach/blob/e4718bdda1c43849a8dfbda1b2915e8a6594aab2/bleach/sanitizer.py#L179

So that is consistent with the usage of the BleachSanitizerFilter sanitize_token method, but that is not a consistent format with the parent.

python and bleach versions (please complete the following information):

Expected behavior

I propose one of these 3 options:

willkg commented 2 years ago

Is this still a problem with the latest version of Bleach (4.1.0)? If so, can you provide a test case that illustrates what the issue is?

facundochaud-eb commented 2 years ago

Yes, it's still a problem with the latest version. The pieces of code I shared here have not changed in those versions.

You could reproduce it like this:

>>> from bleach.sanitizer import BleachSanitizerFilter
>>> from html5lib.filters.sanitizer import Filter as HtmlSanitizerFilter
>>> stream = [{'type': 'Characters', 'data': 'title'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'type': 'StartTag', 'name': 'p', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': OrderedDict([((None, 'onclick'), "window.location.href='http://www.evil.com'")])}, {'type': 'Characters', 'data': 'Evil'}, {'type': 'EndTag', 'name': 'p', 'namespace': 'http://www.w3.org/1999/xhtml'}]
>>> [foo for foo in BleachSanitizerFilter(stream)]
[{'data': 'title', 'type': 'Characters'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'data': '<p onclick="window.location.href=\'http://www.evil.com\'">Evil</p>', 'type': 'Characters'}]
>>> [foo for foo in HtmlSanitizerFilter(stream)]
[{'type': 'Characters', 'data': 'title'}, {'data': ' ', 'type': 'SpaceCharacters'}, {'type': 'Characters', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': '<p onclick="window.location.href=\'http://www.evil.com\'">'}, {'type': 'Characters', 'data': 'Evil'}, {'type': 'Characters', 'namespace': 'http://www.w3.org/1999/xhtml', 'data': '</p>'}]

Our previous usage was:

from html5lib import serializer
from html5lib.filters import sanitizer

class HTMLSerializer(serializer.HTMLSerializer):
    # Without this, the html5lib serializer will drop certain ending tags if
    # they are deemed optional: <p>hi</p> -> <p>hi
    omit_optional_tags = False

def _render_stream(stream, sanitize=False):
    # Render a parsed and possibly filtered stream into a string
    s = HTMLSerializer()
    if sanitize:
        stream = sanitizer.Filter(stream)
    return s.render(stream)

That ended up returning title <p>Evil</p>, but if I replace the sanitizer filter by bleach's one I get title &lt;p onclick="window.location.href='http://www.evil.com'"&gt;Evil&lt;/p&gt;

willkg commented 2 years ago

The BleachSanitizerFilter needs to work on a stream built by BleachHTMLTokenizer. What happens when you use that to parse whatever it is you're parsing?

facundochaud-eb commented 2 years ago

The stream I was using was created from an html5lib.HTMLParser. I replaced it with a BleachHTMLParser which uses BleachHTMLTokenizer. Got the same results

willkg commented 2 years ago

You mention alphabetized attributes. We landed a fix for #566. I think that fixes that issue for you, too.

Beyond that, this is a pretty detailed issue report and it's hard for me to wrap my head around what you're trying to report. I skimmed it, but that's clearly not sufficient. I won't have time to really pour over this any time soon.

facundochaud-eb commented 2 years ago

Hmm main TLDR thing of this issue would be that self.allowed_elements is incompatible between bleach and html5lib, and I propose 3 solutions for that. I can open a PR for the last proposal if you like, but I don't know if that would be the go-to option for you

facundochaud-eb commented 2 years ago

There's no hurry in fixing this, we can just keep using old versions, I just wanted to get rid of the deprecation warnings 😅