Open Retr02332 opened 1 year ago
@gpshead
@pochmann Open that HTML in the browser. Then you will see that the script is executed.
So the parser does not detect the script tag here, so if a developer uses this library to parse a user's HTML to later scan against XSS for example, an attacker would end up bypassing the protections in this case.
@pochmann Not exactly: https://www.youtube.com/watch?v=H1TVk3HhL9E
The parser must parse the user input well. The golang net/html package validates well the input I put in this issue for example.
Background: This was reported to the PSRT first, we suggested it was better to file as a public issue and work on it here. Thanks for reporting and filing.
Note that animated GIFs or Videos are not a meaningful way to report software bugs. They are a major accessibility problem as they contain no text and thus cannot be used for anything. This is software, software is text. Please just create code and data snippets and check them in to a branch somewhere or post them as Gists if they cannot simply be included in a bug filing itself.
What's missing in this bug as written is explicit actual behavior and expected behavior of the code sample on the example document.
There are better html parsing libraries for Python than html.parser
from the stdlib. Using BeautifulSoup aka bs4
to explore some of them, we can see that even when it uses html.parser it is able to make sense of the example document. But that other parsers such as lxml
and html5lib
do a much better job:
# python -m pip install bs4 lxml html5lib
>>> pprint.pprint([(type(x), x) for x in
bs4.BeautifulSoup('<!--!> <h1 value="--!><script>alert(document.domain)</script>',
'html.parser').contents])
[(<class 'bs4.element.NavigableString'>, '<!--!> <h1 value="--!>'),
(<class 'bs4.element.Tag'>, <script>alert(document.domain)</script>)]
>>> pprint.pprint([(type(x), x) for x in
bs4.BeautifulSoup('<!--!> <h1 value="--!><script>alert(document.domain)</script>',
'html5lib').contents])
[(<class 'bs4.element.Comment'>, '!> <h1 value="'),
(<class 'bs4.element.Tag'>,
<html><head><script>alert(document.domain)</script></head><body></body></html>)]
>>> pprint.pprint([(type(x), x) for x in
bs4.BeautifulSoup('<!--!> <h1 value="--!><script>alert(document.domain)</script>',
'lxml').contents])
[(<class 'bs4.element.Comment'>, '!> <h1 value="'),
(<class 'bs4.element.Tag'>,
<html><head><script>alert(document.domain)</script></head></html>)]
The latter two product similar output to what you'll see in an actual web browsers DOM for this document and are thus what I'd suggest people actually use.
I'm not sure how bs4
coaxed html.parser
to get the script tag out vs using the html.parser.HTMLParser
callback API as the POC example code does. Regardless, kudos to bs4
which is the library I recommend everyone use for serious html needs.
Those looking into html.parser
for this issue should try to understand how it's HTMLParser
is supposed to work and vs what is really does in the face of an HTML Comment as defined in https://html.spec.whatwg.org/#comments. (and realize that html.parser
code was likely written long before WHATWG even existed)
1. I tested with only stattag <div>
:
>>> parser.feed('<div> this is div content ')
Result:
Start tag: div
Data: this is div content
2. But when I test with <style>
or <script>
tag I can't get the data
>>> parser.feed('<script> this is div content ')
Result:
Start tag: script
I wonder if this is a problem?
As we can see there (comment was added by me for clarity), the parser do not recognize closing comment tag.
According to whatwg
Comments must have the following format:
- The string "<!--".
- Optionally, text, with the additional restriction that the text must not start with the string ">", nor start with the string "->", nor contain the strings "", or "--!>", nor end with the string "<!-".
- The string "-->".
Pay attention to 3. "POC" uses --!>
instead of proper -->
, but browsers are more lenient due to the recommendation that I found in a comment of the GNOME's libxml2 that lxml potentially uses.
However, htmlParseErr
is invoked in the libxml2 on the incorrect comment closing tag.
My recommendation based on my findings: change Lib\_markupbase.py
:12 line from _commentclose = re.compile(r'--\s*>')
to _commentclose = re.compile(r'--!?>')
and potentially add some output in case if the !
sign is met in a closing tag.
(and realize that
html.parser
code was likely written long before WHATWG even existed)
FWIW a few years ago I updated it to handle HTML5 according to the specs, even though it's not 100% compliant (especially with some corner cases and invalid markup). html.parser
should follow the specs whenever possible and ideally behave like browsers do, without ever raising errors for invalid markup.
Description
The library
html.parser
allows an attacker to bypass any whitelist of HTML tags and attributes that seek to mitigate XSS. This is possible because the application does not correctly parse the HTML comments in the user input.Vulnerability
This vulnerability occurs because the application does not correctly parse the HTML comments in the user input.
Exploitation
In this scenario a developer parses the HTML entered by the user to validate it with an allowlist of tags and attributes. This is to prevent XSS attacks. In this case we see how we can bypass a security check of this type, thanks to the fact that the parser does not parse the HTML comments properly.
poc.py
Evidence of exploitation
Expected behavior
System Information
Linked PRs