html5lib / html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
MIT License
1.13k stars 284 forks source link

Fuzzing reveals a number of parse errors #568

Open leonardr opened 1 year ago

leonardr commented 1 year ago

I'm the lead developer of Beautiful Soup, which has html5lib as an optional dependency. Over the past couple of years I've gotten a number of notifications from Google's oss-fuzz project about unhandled exceptions that actually turned out to be problems in html5lib. There wasn't much I could do with these errors, but now that it looks like html5lib maintenance is picking up, I can pass them on to you. (Sorry. :crying_cat_face:)

I've incorporated the fuzz reports into the Beautiful Soup test suite, and the test cases themselves are here, but here's a general picture of what problems I see. In each case, I believe just parsing the bad markup is enough to trigger the error.

clusterfuzz-testcase-minimized-bs4_fuzzer-4999465949331456

Markup: b')<a><math><TR><a><mI><a><p><a>'

Error:

self = <html>, node = <p>, refNode = None

    def insertBefore(self, node, refNode):
>       index = self.element.index(refNode.element)
E       AttributeError: 'NoneType' object has no attribute 'element'

clusterfuzz-testcase-minimized-bs4_fuzzer-5843991618256896

Markup: b'-<math><sElect><mi><sElect><sElect>'

Error:

    def resetInsertionMode(self):
    ...
            # Check for conditions that should only happen in the innerHTML
            # case
            if nodeName in ("select", "colgroup", "head", "html"):
>               assert self.innerHTML
E               AssertionError

clusterfuzz-testcase-minimized-bs4_fuzzer-6241471367348224

Markup: b'ñ<table><svg><html>'

Error:

self = <html5lib.html5parser.getPhases.<locals>.InTablePhase object at 0x7f8f405ad440>

    def processEOF(self):
        if self.tree.openElements[-1].name != "html":
            self.parser.parseError("eof-in-table")
        else:
>           assert self.parser.innerHTML
E           AssertionError

clusterfuzz-testcase-minimized-bs4_fuzzer-6600557255327744

Markup: b'\t<TABLE><<!>;<!><<!>.<lec><th>i><a><mat\x00\x01<mi\x00a><math>><th><mI>chardeta\xff\xff\xff\xff<><th><mI><||||||||A<select><>qu?\xbemath><th><mie>qu'

Error:

self = <html5lib.html5parser.getPhases.<locals>.InTableBodyPhase object at 0x7f8f4184ce00>

    def clearStackToTableBodyContext(self):
        while self.tree.openElements[-1].name not in ("tbody", "tfoot",
                                                      "thead", "html"):
            # self.parser.parseError("unexpected-implied-end-tag-in-table",
            #  {"name": self.tree.openElements[-1].name})
            self.tree.openElements.pop()
        if self.tree.openElements[-1].name == "html":
>           assert self.parser.innerHTML
E           AssertionError

Also reported to me recently was the issue that was reported to you as issue #557.

leonardr commented 1 year ago

Another such error: clusterfuzz-testcase-minimized-bs4_fuzzer-6401239223762944

Markup: <math>\x10<select><mi><select><select>t

Same assert self.parser.innterHTML AssertionError as seen before. Going forward I'll probably only mention issues that look new.

leonardr commented 1 year ago

This one is different from the rest:

Markup: b'y<framesetboheadrb$al>t<table><><t><th><math><th>u<\x0ch><mi><thx><TR>ind><<meta><i<isind<i\xff\xff\xff\xffex><select><<tr>i=ut\x00\x007>'

Raises an IndexError:

  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 284, in parse
    self._parse(stream, False, None, *args, **kwargs)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 133, in _parse
    self.mainLoop()
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 240, in mainLoop
    new_token = phase.processStartTag(new_token)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 469, in processStartTag
    return func(token)
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 2232, in startTagTableOther
    self.closeCell()
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 2220, in closeCell
    self.endTagTableCell(impliedTagToken("th"))
  File "/usr/lib/python3/dist-packages/html5lib/html5parser.py", line 2254, in endTagTableCell
    self.tree.clearActiveFormattingElements()
  File "/usr/lib/python3/dist-packages/html5lib/treebuilders/base.py", line 265, in clearActiveFormattingElements
    entry = self.activeFormattingElements.pop()
IndexError: pop from empty list