python / cpython

The Python programming language
https://www.python.org
Other
63.54k stars 30.44k forks source link

XSS in html.parser library #102555

Open Retr02332 opened 1 year ago

Retr02332 commented 1 year ago

Description

The library html.parser allows an attacker to bypass any whitelist of HTML tags and attributes that seek to mitigate XSS. This is possible because the application does not correctly parse the HTML comments in the user input.

Vulnerability

This vulnerability occurs because the application does not correctly parse the HTML comments in the user input.

Exploitation

In this scenario a developer parses the HTML entered by the user to validate it with an allowlist of tags and attributes. This is to prevent XSS attacks. In this case we see how we can bypass a security check of this type, thanks to the fact that the parser does not parse the HTML comments properly.

poc.py

from html.parser import HTMLParser
from html.entities import name2codepoint

class MyHTMLParser(HTMLParser):
    def handle_starttag(self, tag, attrs):
        print("Start tag:", tag)
        # Whitelist Tags
        print("Invalid tag:",tag != "h1")
        for attr in attrs:
            # Whitelist Attr
            print("attr:", attr)
            print("Invalid attr:",attr != "alt")

    def handle_endtag(self, tag):
        print("End tag  :", tag)

    def handle_data(self, data):
        print("Data     :", data)

    def handle_comment(self, data):
        print("Comment  :", data)

    def handle_entityref(self, name):
        c = chr(name2codepoint[name])
        print("Named ent:", c)

    def handle_charref(self, name):
        if name.startswith('x'):
            c = chr(int(name[1:], 16))
        else:
            c = chr(int(name))
        print("Num ent  :", c)

    def handle_decl(self, data):
        print("Decl     :", data)

parser = MyHTMLParser()
parser.feed('<!--!> <h1 value="--!><script>alert(document.domain)</script>')
# HTML is safe, we can proceed

Evidence of exploitation

python-exploit

Expected behavior

safe-python

System Information

Linked PRs

Retr02332 commented 1 year ago

@gpshead

Retr02332 commented 1 year ago

@pochmann Open that HTML in the browser. Then you will see that the script is executed.

So the parser does not detect the script tag here, so if a developer uses this library to parse a user's HTML to later scan against XSS for example, an attacker would end up bypassing the protections in this case.

Retr02332 commented 1 year ago

@pochmann Not exactly: https://www.youtube.com/watch?v=H1TVk3HhL9E

The parser must parse the user input well. The golang net/html package validates well the input I put in this issue for example.

gpshead commented 1 year ago

Background: This was reported to the PSRT first, we suggested it was better to file as a public issue and work on it here. Thanks for reporting and filing.

Note that animated GIFs or Videos are not a meaningful way to report software bugs. They are a major accessibility problem as they contain no text and thus cannot be used for anything. This is software, software is text. Please just create code and data snippets and check them in to a branch somewhere or post them as Gists if they cannot simply be included in a bug filing itself.

What's missing in this bug as written is explicit actual behavior and expected behavior of the code sample on the example document.

gpshead commented 1 year ago

There are better html parsing libraries for Python than html.parser from the stdlib. Using BeautifulSoup aka bs4 to explore some of them, we can see that even when it uses html.parser it is able to make sense of the example document. But that other parsers such as lxml and html5lib do a much better job:

# python -m pip install bs4 lxml html5lib
>>> pprint.pprint([(type(x), x) for x in
                   bs4.BeautifulSoup('<!--!> <h1 value="--!><script>alert(document.domain)</script>',
                                     'html.parser').contents])
[(<class 'bs4.element.NavigableString'>, '<!--!> <h1 value="--!>'),
 (<class 'bs4.element.Tag'>, <script>alert(document.domain)</script>)]

>>> pprint.pprint([(type(x), x) for x in
                   bs4.BeautifulSoup('<!--!> <h1 value="--!><script>alert(document.domain)</script>',
                                     'html5lib').contents])
[(<class 'bs4.element.Comment'>, '!> <h1 value="'),
 (<class 'bs4.element.Tag'>,
  <html><head><script>alert(document.domain)</script></head><body></body></html>)]

>>> pprint.pprint([(type(x), x) for x in
                   bs4.BeautifulSoup('<!--!> <h1 value="--!><script>alert(document.domain)</script>',
                                     'lxml').contents])
[(<class 'bs4.element.Comment'>, '!> <h1 value="'),
 (<class 'bs4.element.Tag'>,
  <html><head><script>alert(document.domain)</script></head></html>)]

The latter two product similar output to what you'll see in an actual web browsers DOM for this document and are thus what I'd suggest people actually use.

I'm not sure how bs4 coaxed html.parser to get the script tag out vs using the html.parser.HTMLParser callback API as the POC example code does. Regardless, kudos to bs4 which is the library I recommend everyone use for serious html needs.

Those looking into html.parser for this issue should try to understand how it's HTMLParser is supposed to work and vs what is really does in the face of an HTML Comment as defined in https://html.spec.whatwg.org/#comments. (and realize that html.parser code was likely written long before WHATWG even existed)

Hitachi-Pika commented 1 year ago

1. I tested with only stattag <div>: >>> parser.feed('<div> this is div content ')

Result: Start tag: div Data: this is div content

2. But when I test with <style> or <script> tag I can't get the data

>>> parser.feed('<script> this is div content ') Result: Start tag: script I wonder if this is a problem?

Privat33r-dev commented 7 months ago

image As we can see there (comment was added by me for clarity), the parser do not recognize closing comment tag.

According to whatwg

Comments must have the following format:

  1. The string "<!--".
  2. Optionally, text, with the additional restriction that the text must not start with the string ">", nor start with the string "->", nor contain the strings "", or "--!>", nor end with the string "<!-".
  3. The string "-->".

Pay attention to 3. "POC" uses --!> instead of proper -->, but browsers are more lenient due to the recommendation that I found in a comment of the GNOME's libxml2 that lxml potentially uses.

However, htmlParseErr is invoked in the libxml2 on the incorrect comment closing tag.

My recommendation based on my findings: change Lib\_markupbase.py:12 line from _commentclose = re.compile(r'--\s*>') to _commentclose = re.compile(r'--!?>') and potentially add some output in case if the ! sign is met in a closing tag.

ezio-melotti commented 7 months ago

(and realize that html.parser code was likely written long before WHATWG even existed)

FWIW a few years ago I updated it to handle HTML5 according to the specs, even though it's not 100% compliant (especially with some corner cases and invalid markup). html.parser should follow the specs whenever possible and ideally behave like browsers do, without ever raising errors for invalid markup.