Open cjerdonek opened 7 years ago
Also, lxml.etree
and lxml.html
do behave the right way:
tree_type = 'lxml'
html = '<div><b>foo</b></div>'
root = html5lib.parseFragment(html, namespaceHTMLElements=False,
treebuilder=tree_type)
# Get the <b> element.
element = root[0][0]
print(html5lib.serialize(element, tree=tree_type))
print(lxml.etree.tostring(element, encoding='unicode'))
print(lxml.html.tostring(element, encoding='unicode'))
Outputs:
<b>foo</b></div>
<b>foo</b>
<b>foo</b>
I'm guessing this issue has to do with the fact that etree_lxml.TreeWalker.__init__()
wraps tree
in a different object before passing it to base.NonRecursiveTreeWalker.__init__()
:
class TreeWalker(base.NonRecursiveTreeWalker):
def __init__(self, tree):
# pylint:disable=redefined-variable-type
if isinstance(tree, list):
self.fragmentChildren = set(tree)
tree = FragmentRoot(tree)
else:
self.fragmentChildren = set()
tree = Root(tree)
base.NonRecursiveTreeWalker.__init__(self, tree)
self.filter = _ihatexml.InfosetFilter()
NonRecursiveTreeWalker
does an identity check on self.tree
to decide whether to terminate:
if self.tree is currentNode:
currentNode = None
break
So the wrapping could be tripping up that logic.
You're right, removing the Root wrapper line solves this problem and some others too.
That wrapper also gives problems when you try to walk over an element that has siblings. What is it even supposed to accomplish?
The walker is broken for list of elements, too (like returned by parseFragments when there are multiple root elements). In its current state the lxml walker is broken, and only works in some edge cases.
That wrapper also gives problems when you try to walk over an element that has siblings. What is it even supposed to accomplish?
IIRC, it's meant to ensure that the DOCTYPE gets included when we have a whole tree. We probably should just special-case getting an lxml.etree._ElementTree
object for that.
The walker is broken for list of elements, too (like returned by parseFragments when there are multiple root elements). In its current state the lxml walker is broken, and only works in some edge cases.
As far as I can tell (from what we have tests for), it works fine for whole trees and for whole fragments. What it doesn't work for is arbitrary elements.
For fragments, I'm getting that lxml.etree
elements do work when passing treeType 'lxml'
to getTreeWalker()
used for the serializer (which is the case this issue was originally about), but it doesn't work when passing 'etree'
. So it's the opposite of this issue.
Here's code to reproduce:
def render(element, treeType, implementation=None):
walker_cls = html5lib.getTreeWalker(treeType, implementation=implementation)
walker = walker_cls(element)
serializer = HTMLSerializer()
print(serializer.render(walker))
elements = html5lib.parseFragment('<i>a</i><i>b</i>', treebuilder='lxml')
# Works.
render(elements, 'lxml')
# Both of these fail with the exception below.
try:
render(elements, 'etree')
except Exception as err:
print(err)
render(elements, 'etree', implementation=lxml.etree)
which outputs:
<i>a</i><i>b</i>
'list' object has no attribute 'getroot'
Traceback (most recent call last):
File "test-html5lib.py", line 28, in <module>
render(elements, 'etree', implementation=lxml.etree)
File "test-html5lib.py", line 20, in render
print(serializer.render(walker))
File "/.../python3.6/site-packages/html5lib/serializer.py", line 323, in render
return "".join(list(self.serialize(treewalker)))
File "/.../python3.6/site-packages/html5lib/serializer.py", line 209, in serialize
for token in treewalker:
File "/.../python3.6/site-packages/html5lib/filters/optionaltags.py", line 18, in __iter__
for previous, token, next in self.slider():
File "/.../python3.6/site-packages/html5lib/filters/optionaltags.py", line 9, in slider
for token in self.source:
File "/.../python3.6/site-packages/html5lib/treewalkers/base.py", line 94, in __iter__
details = self.getNodeDetails(currentNode)
File "/.../python3.6/site-packages/html5lib/treewalkers/etree.py", line 48, in getNodeDetails
node = node.getroot()
AttributeError: 'list' object has no attribute 'getroot'
Should I file a different issue?
Actually now that I've looked at things a bit better wouldn't it be better to scrap the lxml specific stuff and replace it with the etree-API stuff? If I use TreeBuilder = html5lib.treebuilders.etree.getETreeBuilder(lxml.etree)['TreeBuilder']
and likewise for the TreeWalker everything starts working as expected.
Edit: actually better make that html5lib.treewalkers.getTreeWalker('etree', lxml.etree)
Perhaps, but note that (at least as it stands now) the etree serializer has some of its own issues, like the one I mention in the comment before this one, and also this issue: https://github.com/html5lib/html5lib-python/issues/341
Also, I think the original idea behind the lxml serializer is that it is optimized / faster. E.g. see here: https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.__init__.getTreeWalker
Okay, well then all the more reason not to scrap the lxml serializer. Something should be capable of serializing lxml.etree.Entity elements.
Also, I think the original idea behind the lxml serializer is that it is optimized / faster. E.g. see here: https://html5lib.readthedocs.io/en/latest/html5lib.html#html5lib.__init__.getTreeWalker
AFAIK, it was mostly down to API differences around things like Entity
and maybe getroot
and docinfo
and the like. It might simply be unneeded on the tree walker side, though is definitely needed on the tree builder side (where lxml enforces element names etc. match the XML productions, though maybe we have even crazier stuff to undo that which IMO we shouldn't?). If it's practical, I'm not opposed to simply killing the lxml tree walker.
If it's practical, I'm not opposed to simply killing the lxml tree walker.
I think it's important to keep around for the reason I mentioned in my comment just above -- namely, otherwise, there would be no way to serialize trees containing lxml.etree.Entity objects, for example (since #341 was closed).
I've been looking into it, and the problems I've ran into so far are:
Comments: lxml does not support comments with '--' in them, and comments ending with '-'.
doctype: the etree TreeBuilder tries to put the doctype in an Element. In lxml it's stored in DocInfo, and can't be added to tree after it's created.
Still, some small changes enable the etree builder/walker to work. Adding Entity support to it wouldn't be hard either.
I noticed that when serializing an individual element, the lxml serializer continues "past" the element. This causes inconsistent behavior between the "etree" and "lxml" tree types.
A minimal way to reproduce:
Outputs: