Exception in lxml when parsing webapps r443 source

GoogleCodeExporter commented 9 years ago

Parsing the source file of http://svn.whatwg.org/webapps/ at r443 throws an 
exception in lxml:

Traceback (most recent call last):
  File "bug.py", line 3, in <module>
    html5lib.parse(open("source"), treebuilder="lxml")
  File "/usr/local/lib/python2.6/dist-packages/html5lib-0.95_dev-py2.6.egg/html5lib/html5parser.py", line 54, in parse
    return p.parse(doc, encoding=encoding)
  File "/usr/local/lib/python2.6/dist-packages/html5lib-0.95_dev-py2.6.egg/html5lib/html5parser.py", line 225, in parse
    parseMeta=parseMeta, useChardet=useChardet)
  File "/usr/local/lib/python2.6/dist-packages/html5lib-0.95_dev-py2.6.egg/html5lib/html5parser.py", line 115, in _parse
    self.mainLoop()
  File "/usr/local/lib/python2.6/dist-packages/html5lib-0.95_dev-py2.6.egg/html5lib/html5parser.py", line 182, in mainLoop
    new_token= self.phase.processSpaceCharacters(new_token)
  File "/usr/local/lib/python2.6/dist-packages/html5lib-0.95_dev-py2.6.egg/html5lib/html5parser.py", line 1002, in processSpaceCharacters
    self.tree.reconstructActiveFormattingElements()
  File "/usr/local/lib/python2.6/dist-packages/html5lib-0.95_dev-py2.6.egg/html5lib/treebuilders/_base.py", line 212, in reconstructActiveFormattingElements
    clone = entry.cloneNode() #Mainly to get a new copy of the attributes
  File "/usr/local/lib/python2.6/dist-packages/html5lib-0.95_dev-py2.6.egg/html5lib/treebuilders/etree.py", line 136, in cloneNode
    element.attributes[name] = value
  File "lxml.etree.pyx", line 1945, in lxml.etree._Attrib.__setitem__ (src/lxml/lxml.etree.c:42569)
  File "apihelpers.pxi", line 482, in lxml.etree._setAttributeValue (src/lxml/lxml.etree.c:13849)
  File "apihelpers.pxi", line 1417, in lxml.etree._attributeValidOrRaise (src/lxml/lxml.etree.c:21673)
ValueError: Invalid attribute name u'getcontext()<'

I'm attaching the source and a minimized test case, which looks like this:

<p><code x</code></p>

Here's the python code to reproduce it.

import html5lib
html5lib.parse(open("source"), treebuilder="lxml")

Reproducable on tip of hg default branch.

Original issue reported on code.google.com by philip.j...@gmail.com on 6 Mar 2011 at 10:29

Attachments:

GoogleCodeExporter commented 9 years ago

In both Firefox 4 and the Opera Ragnarök build the following DOM is produced 
according to http://software.hixie.ch/utilities/js/live-dom-viewer/saved/871:

<!DOCTYPE HTML><html><head></head><body><p><code x<="" code=""></code></p><code 
x<="" code="">
</code></body></html>

Original comment by philip.j...@gmail.com on 7 Mar 2011 at 7:48

GoogleCodeExporter commented 9 years ago

It seems like there's some coersion of the attribute values that should happen 
that isn't happening in this case, because this input:

<p><code x<=foo></code></p>

Produces this output:

<!DOCTYPE html><p><code xU0003C=foo></code></p>

So perhaps the coersion step is skipped for the original input?

Using this code:

#!/usr/bin/env python
import sys
import html5lib
from html5lib import treebuilders, treewalkers, serializer

doc = html5lib.parse(open("minimized.html"), treebuilder="lxml")

walker = treewalkers.getTreeWalker("lxml")

s = serializer.htmlserializer.HTMLSerializer()

for x in s.serialize(walker(doc)):
    sys.stdout.write(x)

Original comment by philip.j...@gmail.com on 7 Mar 2011 at 8:01

GoogleCodeExporter commented 9 years ago

Original comment by philip.j...@gmail.com on 7 Mar 2011 at 6:51

Added labels: Port-Python, Type-Defect

GoogleCodeExporter commented 9 years ago

Phew, there's actually nothing magic about file input, the difference was a 
trailing linebreak in the file input. This is enough to reproduce:

import html5lib
html5lib.parse("<p><code x</code></p>\n", treebuilder="lxml")

Original comment by philip.j...@gmail.com on 8 Mar 2011 at 8:22

Changed state: Accepted

GoogleCodeExporter commented 9 years ago

James, I now have what seems to be a working fix, could you review it for 
sanity? It's trivial, but I don't really understand the relationship between 
the Element class in treebuilders/etree.py and treebuilders/etree_lxml.py, 
beyond the fact that the latter inherits the former.

The root cause is that the attributes on the underlying etree Element are 
coerced, but the attributes on the wrapping Element are not. The cloneNode was 
trying to copy the uncoerced attributes of the wrapper Element to an lxml 
Element.

Where would it be appropriate to add a test for this?

Original comment by philip.j...@gmail.com on 10 Mar 2011 at 7:55

Attachments:

issue178.patch

GoogleCodeExporter commented 9 years ago

Fixed in 
http://code.google.com/p/html5lib/source/detail?r=99e8af7f0c486da0f7ca7e570177d8
f7b9f68ed4

The fix is a little different to the patch here.

Original comment by ja...@hoppipolla.co.uk on 10 Mar 2011 at 10:42

Changed state: Fixed

html5lib / gcode-import

Exception in lxml when parsing webapps r443 source #178