html5lib / html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
MIT License
1.13k stars 284 forks source link

HTML5Parser has incorrect behavior #558

Closed theRealProHacker closed 1 year ago

theRealProHacker commented 1 year ago

On Version 1.1

The doc strings give this example, which would be great, if it worked, but actually it fails when tree = "lxml"

def __init__(self, tree=None, strict=False, namespaceHTMLElements=True, debug=False):
        """
        :arg tree: a treebuilder class controlling the type of tree that will be
            returned. Built in treebuilders can be accessed through
            html5lib.treebuilders.getTreeBuilder(treeType)

        :arg strict: raise an exception when a parse error is encountered

        :arg namespaceHTMLElements: whether or not to namespace HTML elements

        :arg debug: whether or not to enable debug mode which logs things

        Example:

        >>> from html5lib.html5parser import HTMLParser
        >>> parser = HTMLParser()                     # generates parser with etree builder
        >>> parser = HTMLParser('lxml', strict=True)  # generates parser with lxml builder which is strict

        """
theRealProHacker commented 1 year ago

Update: This is a duplicate of #513, but I would still like to merge the PR and then close this issue and #513