kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

as_libxml.c: Set xmlDoc to be of "HTML" type to fix HTML-specific code in libxml2 #11

Closed Balletie closed 6 years ago

Balletie commented 6 years ago

Some code paths in libxml2 check whether the passed in document is an HTML document. If parsing HTML with lxml/libxml2, then the xmlDoc has (doc->type == XML_HTML_DOCUMENT_NODE) instead of XML_DOCUMENT_NODE. Some libxml2 code checks for this property and chooses a different execution path (e.g. see 1).

My specific problem was that when using lxml's html.tostring method, attribute values containing unicode characters were converted to hexadecimal character references of the form '&#nnnn;' (see 2, in the else block). This works fine for anyone viewing the rendered HTML in their browser. But I'd like to preserve the UTF-8 characters, as is the behavior for lxml, so that I can e.g. edit the HTML file in the future.

Consider the following proof of concept:

>>> import html5_parser
>>> from lxml import html
>>> root1 = html5_parser.parse('<meta name="description" content="Sóme UTF-8 chāracterṣ" />')
>>> root2 = html.fromstring('<meta name="description" content="Sóme UTF-8 chāracterṣ" />')
>>> html.tostring(root1, encoding='unicode')
'<html><head><meta name="description" content="S&#xF3;me UTF-8 ch&#x101;racter&#x1E63;"></head><body></body></html>'
>>> html.tostring(root2, encoding='unicode')
'<html><head><meta name="description" content="Sóme UTF-8 chāracterṣ"></head></html>'

(Ignore the absence of the <body/> element in root2 parsed by lxml)

After my fix, the output is as follows:

>>> html.tostring(root1, encoding='unicode')
'<html><head><meta name="description" content="Sóme UTF-8 chāracterṣ"></head><body></body></html>'
Balletie commented 6 years ago

No idea why Travis fails, but the tests pass on my local machine.

kovidgoyal commented 6 years ago

I'm not sure this is a good idea. IIRC this document type is meant for plain HTML, not XHTML (see for example https://github.com/GNOME/libxml2/blob/790c230fcfbc3d9d5cc272479dc6d0d8765fb9d7/xmlsave.c#L900). Since html5-parser can produce XHTML as well, this should not be done unconditionally. At the least it should be conditional on the value of namespace_elements.

If you want to use tostring() use etree.tostring() rather than html.tostring() then you wont have the entity issue. See #7 (although that was for unicode in element text not attributes, I dont recall if etree.tostring uses entities in attributes as well for non-ascii characters. If it does it is really a bug in libxml2 and should be fixed there. As far as I know non-ascii characters are perfectly legal in XML attributes https://stackoverflow.com/questions/866706/which-characters-are-invalid-unless-encoded-in-an-xml-attribute).

Balletie commented 6 years ago

Sadly etree.tostring doesn't work either :(. As you can see in the below example, only attribute values get converted.

>>> import html5_parser
>>> from lxml import etree
>>> root1 = html5_parser.parse('<p class="test" data-content="Sóme UTF-8 chāracterṣ">Sóme UTF-8 chāracterṣ</p>')
>>> etree.tostring(root1, encoding='unicode')
'<html><head/><body><p class="test" data-content="S&#xF3;me UTF-8 ch&#x101;racter&#x1E63;">Sóme UTF-8 chāracterṣ</p></body></html>'

I pushed the fix in this way because it was the quickest way to get it to work, but I didn't take into account XHTML.

Another way we could get it to work, I think, is by setting the encoding on the document object. As you can see below, the encoding doesn't get set for the tree parsed by html5-parser:

>>> root1.getroottree().docinfo.encoding
>>> root2.getroottree().docinfo.encoding
'UTF-16LE'

If you look at this branch condition, you can see that it either checks the encoding is not NULL, or if it's an HTML document. Although I don't know if it reaches that function when using etree.tostring(), because there's a similar xml escape function here in xmlsave.c.

Balletie commented 6 years ago

I noticed some odd behavior while trying the same example above with lxml's parser:

>>> root2 = html.fromstring('<html><head/><body><p class="test" data-content="Sóme UTF-8 chāracterṣ">Sóme UTF-8 chāracterṣ</p></body></html>')
>>> etree.tostring(root2, encoding='unicode')
'<html><head/><body><p class="test" data-content="Sóme UTF-8 chāracterṣ">Sóme UTF-8 chāracterṣ</p></body></html>'

Above works fine, but if I want to print a single element instead of a complete HTML document, I get the same issue as I do with the three parsed by html5-parser:

>>> etree.tostring(root2.find('.//p'), encoding='unicode')
'<p class="test" data-content="S&#xF3;me UTF-8 ch&#x101;racter&#x1E63;">Sóme UTF-8 chāracterṣ</p>

So you might be right that strictly speaking this isn't an issue in your library... Although maybe the encoding not being set (see previous comment) might be interesting to be investigated.

kovidgoyal commented 6 years ago

I dont see any issue with setting the encoding to UTF-8 by default, technically it should be the encoding of the input document, but in practice I dont think it matters as encoding is only important when serializing and can be overridden in the tostring() method.

Balletie commented 6 years ago

Alright I'll open a new PR with setting the encoding. Wouldn't it be technically correct also, since you always convert your input to UTF-8 internally (1)?