Open jcushman opened 5 years ago
hmm yeah I can reproduce. wbr is listed as a self closing tag on:
and should have:
token["selfClosingAcknowledged"] = True
but I get
{'type': 'StartTag', 'name': 'wbr', 'namespace': None, 'data': OrderedDict()}
{'type': 'EndTag', 'name': 'wbr', 'namespace': None}
at https://github.com/mozilla/bleach/blob/master/bleach/sanitizer.py#L271 so I'm thinking one of these things might be going on:
tagOpenState
or another method in html5lib_shim.py
leaves the parser in a bad state that causes it to not be recognized as a self closing tagtags
arg doesn't pass the tag a self closing tagbut I'll need to find more time to look into it further.
OK this is a bug in html5lib (v1.1 at least):
» python
Python 3.8.2 (default, Mar 26 2020, 12:39:19)
[GCC 7.5.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import bleach._vendor.html5lib as html5lib
>>> html5lib.__version__
'1.1'
>>> html5lib.serialize(html5lib.parseFragment('<area>')) # this is correct
'<area>'
>>> html5lib.serialize(html5lib.parseFragment('<wbr>')) # should be <wbr>
'<wbr></wbr>'
>>> html5lib.serialize(html5lib.parseFragment('<keygen>')) # HTML 5.2 deprecates the tag
'<keygen></keygen>'
>>> html5lib.serialize(html5lib.parseFragment('<menuitem>')) # https://github.com/html5lib/html5lib-python/issues/203 mentions this but https://developer.mozilla.org/en-US/docs/Web/HTML/Element/menuitem shows non-void examples and says HTML 5.2 deprecates it
'<menuitem></menuitem>'
the upstream issue is https://github.com/html5lib/html5lib-python/issues/203 upstream PR for wbr https://github.com/html5lib/html5lib-python/pull/395
Not sure what html5lib's position on deprecated elements is.
This is now addressed in html5lib: https://github.com/html5lib/html5lib-python/pull/395
Waiting on an html5lib release with this fix. Then we can update the vendored html5lib and test everything.
The
<wbr>
element is balanced bybleach.clean
even though it is an empty element.Using the list of empty tags from MDN:
The output includes
<wbr></wbr>
when it should just be<wbr>
like the others.keygen
has the same problem, but that's deprecated so I'm not sure if it's worth including.