kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

Non-ASCII contents are escaped when serializing with et.tostring(method='html') #7

Closed ciscorn closed 6 years ago

ciscorn commented 6 years ago

code:

import lxml.etree as et
import html5_parser
import html5lib

t1 = html5lib.parse("<p>ああああ", treebuilder="lxml", namespaceHTMLElements=False)
t2 = html5_parser.parse("<p>ああああ", treebuilder="lxml")

print('html5lib')
print(et.tostring(t1, encoding='unicode', method="html"))
print()
print('html5_parser')
print(et.tostring(t2, encoding='unicode', method="html"))

the result is:

html5lib
<html>
<head></head>
<body><p>ああああ</p></body>
</html>

html5_parser
<html>
<head></head>
<body><p>&#x3042;&#x3042;&#x3042;&#x3042;</p></body>
</html>
kovidgoyal commented 6 years ago

Works for me, with python3

python3 -c "from lxml.etree import tostring; from html5_parser import parse; print(tostring(parse('<p>ああああ'), encoding='unicode'))"  
<html><head/><body><p>ああああ</p></body></html>

and with python2

 python2 -c "from lxml.etree import tostring; from html5_parser import parse; print(tostring(parse('<p>ああああ'), encoding='unicode'))" 
<html><head/><body><p>ああああ</p></body></html>
kovidgoyal commented 6 years ago

And this is with lxml 4.1

ciscorn commented 6 years ago

@kovidgoyal This problem occurs only when giving the method='html' parameter.

I tested on macOS and Arch Linux with Python 3.6 / lxml 4.1.

kovidgoyal commented 6 years ago

Dont use method='html' it is slower and has various bugs.

kovidgoyal commented 6 years ago

To expand on that, method='html' uses the htmlNodeDumpFormatOutput function from libxml. This function in turn either expects text it receives to already have unsafe characters escaped or it escapes them itself. Unfortunately, its deifnition of unsafe is overbroad -- it includes all non-ascii characters in that definition. Therefore, in order to achieve the same result as html5lib, html5-parser would have to escape XML unsafe characters itself, which is a huge and unneccessary performance hit.