mozilla / bleach

Bleach is an allowed-list-based HTML sanitizing library that escapes or strips markup and attributes
https://bleach.readthedocs.io/en/latest/
Other
2.66k stars 250 forks source link

wbr element shouldn't be balanced #488

Open jcushman opened 5 years ago

jcushman commented 5 years ago

The <wbr> element is balanced by bleach.clean even though it is an empty element.

Using the list of empty tags from MDN:

In [6]: empty_elements = {
   ...:     'area', 'base', 'br', 'col', 'embed', 'hr', 'img', 'input', 'keygen', 'link', 'meta', 'param', 'source', 'track', 'wbr'
   ...: }

In [7]: html = "".join("<%s>" % s for s in empty_elements)

In [8]: import bleach

In [9]: bleach.clean(html, tags=empty_elements)
Out[9]: '<param><source><hr><base><track><area><wbr></wbr><br><img><keygen></keygen><link><input><meta><embed>'

The output includes <wbr></wbr> when it should just be <wbr> like the others. keygen has the same problem, but that's deprecated so I'm not sure if it's worth including.

g-k commented 4 years ago

hmm yeah I can reproduce. wbr is listed as a self closing tag on:

https://github.com/mozilla/bleach/blob/a06cd773694721f7cace21d09958afdf301f338d/bleach/_vendor/html5lib/html5parser.py#L964-L965

and should have:

token["selfClosingAcknowledged"] = True

but I get

{'type': 'StartTag', 'name': 'wbr', 'namespace': None, 'data': OrderedDict()}
{'type': 'EndTag', 'name': 'wbr', 'namespace': None}

at https://github.com/mozilla/bleach/blob/master/bleach/sanitizer.py#L271 so I'm thinking one of these things might be going on:

but I'll need to find more time to look into it further.

g-k commented 4 years ago

OK this is a bug in html5lib (v1.1 at least):

» python
Python 3.8.2 (default, Mar 26 2020, 12:39:19)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bleach._vendor.html5lib as html5lib
>>> html5lib.__version__
'1.1'
>>> html5lib.serialize(html5lib.parseFragment('<area>')) # this is correct
'<area>'
>>> html5lib.serialize(html5lib.parseFragment('<wbr>')) # should be <wbr>
'<wbr></wbr>'
>>> html5lib.serialize(html5lib.parseFragment('<keygen>')) # HTML 5.2 deprecates the tag
'<keygen></keygen>'
>>> html5lib.serialize(html5lib.parseFragment('<menuitem>')) # https://github.com/html5lib/html5lib-python/issues/203 mentions this but https://developer.mozilla.org/en-US/docs/Web/HTML/Element/menuitem shows non-void examples and says HTML 5.2 deprecates it
'<menuitem></menuitem>'

the upstream issue is https://github.com/html5lib/html5lib-python/issues/203 upstream PR for wbr https://github.com/html5lib/html5lib-python/pull/395

Not sure what html5lib's position on deprecated elements is.

ambv commented 1 year ago

This is now addressed in html5lib: https://github.com/html5lib/html5lib-python/pull/395

willkg commented 1 month ago

Waiting on an html5lib release with this fix. Then we can update the vendored html5lib and test everything.