When applied to a document tree returned by html5-parser, etree.tostring(method='html', encoding='...') converts all non-ASCII characters to HTML entities, even if they are representable in the desired encoding. This doesn't happen with a document tree produced by lxml.etree.HTML, and it also doesn't happen when using method='xml'. For example:
Python 3.6.5+ (default, Jun 8 2018, 21:55:12)
IPython 5.5.0 -- An enhanced Interactive Python.
In [1]: import lxml.etree
In [2]: import html5_parser
In [3]: d1 = lxml.etree.HTML("<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body></body></html>")
In [4]: d2 = html5_parser.parse("<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body></body></html>")
In [5]: lxml.etree.tostring(d1, encoding='unicode', method='html')
Out[5]: '<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body></body></html>'
In [6]: lxml.etree.tostring(d2, encoding='unicode', method='html')
Out[6]: '<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body></body></html>'
In [7]: lxml.etree.tostring(d2, encoding='unicode', method='xml')
Out[7]: '<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body/></html>'
Desired behavior would be for Out[5] and Out[6] to be the same.
When applied to a document tree returned by html5-parser,
etree.tostring(method='html', encoding='...')
converts all non-ASCII characters to HTML entities, even if they are representable in the desired encoding. This doesn't happen with a document tree produced bylxml.etree.HTML
, and it also doesn't happen when usingmethod='xml'
. For example:Desired behavior would be for
Out[5]
andOut[6]
to be the same.