kovidgoyal / html5-parser

Fast C based HTML 5 parsing for python
Apache License 2.0
678 stars 33 forks source link

etree.tostring(method='html') unnecessarily escapes all non-ASCII characters #13

Closed zackw closed 6 years ago

zackw commented 6 years ago

When applied to a document tree returned by html5-parser, etree.tostring(method='html', encoding='...') converts all non-ASCII characters to HTML entities, even if they are representable in the desired encoding. This doesn't happen with a document tree produced by lxml.etree.HTML, and it also doesn't happen when using method='xml'. For example:

Python 3.6.5+ (default, Jun  8 2018, 21:55:12) 
IPython 5.5.0 -- An enhanced Interactive Python.

In [1]: import lxml.etree
In [2]: import html5_parser
In [3]: d1 = lxml.etree.HTML("<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body></body></html>")
In [4]: d2 = html5_parser.parse("<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body></body></html>")
In [5]: lxml.etree.tostring(d1, encoding='unicode', method='html')
Out[5]: '<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body></body></html>'

In [6]: lxml.etree.tostring(d2, encoding='unicode', method='html')
Out[6]: '<html><head><title>&#xE2A;&#xE33;&#xE19;&#xE31;&#xE01;&#xE07;&#xE32;&#xE19;&#xE04;&#xE13;&#xE30;&#xE01;&#xE23;&#xE23;&#xE21;&#xE01;&#xE32;&#xE23;&#xE01;&#xE32;&#xE23;&#xE40;&#xE25;&#xE37;&#xE2D;&#xE01;&#xE15;&#xE31;&#xE49;&#xE07;</title></head><body></body></html>'

In [7]: lxml.etree.tostring(d2, encoding='unicode', method='xml')
Out[7]: '<html><head><title>สำนักงานคณะกรรมการการเลือกตั้ง</title></head><body/></html>'

Desired behavior would be for Out[5] and Out[6] to be the same.

kovidgoyal commented 6 years ago

dup of #7