html5lib / html5lib-python

Standards-compliant library for parsing and serializing HTML documents and fragments in Python
MIT License
1.13k stars 284 forks source link

lxml serializer goes "past" individual elements #338

Open cjerdonek opened 7 years ago

cjerdonek commented 7 years ago

I noticed that when serializing an individual element, the lxml serializer continues "past" the element. This causes inconsistent behavior between the "etree" and "lxml" tree types.

A minimal way to reproduce:

def render(html, tree_type):
    root = html5lib.parseFragment(html, namespaceHTMLElements=False,
                             treebuilder=tree_type)
    # Get the <b> element.
    element = root[0][0]
    print(html5lib.serialize(element, tree=tree_type))

html = '<div><b>foo</b></div>'

render(html, 'etree')
render(html, 'lxml')

Outputs:

<b>foo</b>
<b>foo</b></div>
cjerdonek commented 7 years ago

Also, lxml.etree and lxml.html do behave the right way:

tree_type = 'lxml'
html = '<div><b>foo</b></div>'
root = html5lib.parseFragment(html, namespaceHTMLElements=False,
                         treebuilder=tree_type)
# Get the <b> element.
element = root[0][0]
print(html5lib.serialize(element, tree=tree_type))
print(lxml.etree.tostring(element, encoding='unicode'))
print(lxml.html.tostring(element, encoding='unicode'))

Outputs:

<b>foo</b></div>
<b>foo</b>
<b>foo</b>
cjerdonek commented 7 years ago

I'm guessing this issue has to do with the fact that etree_lxml.TreeWalker.__init__() wraps tree in a different object before passing it to base.NonRecursiveTreeWalker.__init__():

class TreeWalker(base.NonRecursiveTreeWalker):
    def __init__(self, tree):
        # pylint:disable=redefined-variable-type
        if isinstance(tree, list):
            self.fragmentChildren = set(tree)
            tree = FragmentRoot(tree)
        else:
            self.fragmentChildren = set()
            tree = Root(tree)
        base.NonRecursiveTreeWalker.__init__(self, tree)
        self.filter = _ihatexml.InfosetFilter()

NonRecursiveTreeWalker does an identity check on self.tree to decide whether to terminate:

if self.tree is currentNode:
    currentNode = None
    break

So the wrapping could be tripping up that logic.

indivisible commented 7 years ago

You're right, removing the Root wrapper line solves this problem and some others too.

That wrapper also gives problems when you try to walk over an element that has siblings. What is it even supposed to accomplish?

The walker is broken for list of elements, too (like returned by parseFragments when there are multiple root elements). In its current state the lxml walker is broken, and only works in some edge cases.

gsnedders commented 7 years ago

That wrapper also gives problems when you try to walk over an element that has siblings. What is it even supposed to accomplish?

IIRC, it's meant to ensure that the DOCTYPE gets included when we have a whole tree. We probably should just special-case getting an lxml.etree._ElementTree object for that.

The walker is broken for list of elements, too (like returned by parseFragments when there are multiple root elements). In its current state the lxml walker is broken, and only works in some edge cases.

As far as I can tell (from what we have tests for), it works fine for whole trees and for whole fragments. What it doesn't work for is arbitrary elements.

cjerdonek commented 7 years ago

For fragments, I'm getting that lxml.etree elements do work when passing treeType 'lxml' to getTreeWalker() used for the serializer (which is the case this issue was originally about), but it doesn't work when passing 'etree'. So it's the opposite of this issue.

Here's code to reproduce:

def render(element, treeType, implementation=None):
    walker_cls = html5lib.getTreeWalker(treeType, implementation=implementation)
    walker = walker_cls(element)
    serializer = HTMLSerializer()
    print(serializer.render(walker))

elements = html5lib.parseFragment('<i>a</i><i>b</i>', treebuilder='lxml')

# Works.
render(elements, 'lxml')
# Both of these fail with the exception below.
try:
    render(elements, 'etree')
except Exception as err:
    print(err)
render(elements, 'etree', implementation=lxml.etree)

which outputs:

<i>a</i><i>b</i>
'list' object has no attribute 'getroot'
Traceback (most recent call last):
  File "test-html5lib.py", line 28, in <module>
    render(elements, 'etree', implementation=lxml.etree)
  File "test-html5lib.py", line 20, in render
    print(serializer.render(walker))
  File "/.../python3.6/site-packages/html5lib/serializer.py", line 323, in render
    return "".join(list(self.serialize(treewalker)))
  File "/.../python3.6/site-packages/html5lib/serializer.py", line 209, in serialize
    for token in treewalker:
  File "/.../python3.6/site-packages/html5lib/filters/optionaltags.py", line 18, in __iter__
    for previous, token, next in self.slider():
  File "/.../python3.6/site-packages/html5lib/filters/optionaltags.py", line 9, in slider
    for token in self.source:
  File "/.../python3.6/site-packages/html5lib/treewalkers/base.py", line 94, in __iter__
    details = self.getNodeDetails(currentNode)
  File "/.../python3.6/site-packages/html5lib/treewalkers/etree.py", line 48, in getNodeDetails
    node = node.getroot()
AttributeError: 'list' object has no attribute 'getroot'

Should I file a different issue?

indivisible commented 7 years ago

Actually now that I've looked at things a bit better wouldn't it be better to scrap the lxml specific stuff and replace it with the etree-API stuff? If I use TreeBuilder = html5lib.treebuilders.etree.getETreeBuilder(lxml.etree)['TreeBuilder'] and likewise for the TreeWalker everything starts working as expected.

Edit: actually better make that html5lib.treewalkers.getTreeWalker('etree', lxml.etree)

cjerdonek commented 7 years ago

Perhaps, but note that (at least as it stands now) the etree serializer has some of its own issues, like the one I mention in the comment before this one, and also this issue: https://github.com/html5lib/html5lib-python/issues/341

cjerdonek commented 7 years ago

Also, I think the original idea behind the lxml serializer is that it is optimized / faster. E.g. see here: https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.__init__.getTreeWalker

indivisible commented 7 years ago

341 doesn't look like a bug: the Entity class in lxml.etree is an lxml exclusive extension to the API.

cjerdonek commented 7 years ago

Okay, well then all the more reason not to scrap the lxml serializer. Something should be capable of serializing lxml.etree.Entity elements.

gsnedders commented 7 years ago

Also, I think the original idea behind the lxml serializer is that it is optimized / faster. E.g. see here: https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.__init__.getTreeWalker

AFAIK, it was mostly down to API differences around things like Entity and maybe getroot and docinfo and the like. It might simply be unneeded on the tree walker side, though is definitely needed on the tree builder side (where lxml enforces element names etc. match the XML productions, though maybe we have even crazier stuff to undo that which IMO we shouldn't?). If it's practical, I'm not opposed to simply killing the lxml tree walker.

cjerdonek commented 7 years ago

If it's practical, I'm not opposed to simply killing the lxml tree walker.

I think it's important to keep around for the reason I mentioned in my comment just above -- namely, otherwise, there would be no way to serialize trees containing lxml.etree.Entity objects, for example (since #341 was closed).

indivisible commented 7 years ago

I've been looking into it, and the problems I've ran into so far are:

  1. Comments: lxml does not support comments with '--' in them, and comments ending with '-'.

  2. doctype: the etree TreeBuilder tries to put the doctype in an Element. In lxml it's stored in DocInfo, and can't be added to tree after it's created.

Still, some small changes enable the etree builder/walker to work. Adding Entity support to it wouldn't be hard either.