Closed ciscorn closed 6 years ago
Works for me, with python3
python3 -c "from lxml.etree import tostring; from html5_parser import parse; print(tostring(parse('<p>ああああ'), encoding='unicode'))"
<html><head/><body><p>ああああ</p></body></html>
and with python2
python2 -c "from lxml.etree import tostring; from html5_parser import parse; print(tostring(parse('<p>ああああ'), encoding='unicode'))"
<html><head/><body><p>ああああ</p></body></html>
And this is with lxml 4.1
@kovidgoyal This problem occurs only when giving the method='html'
parameter.
I tested on macOS and Arch Linux with Python 3.6 / lxml 4.1.
Dont use method='html' it is slower and has various bugs.
To expand on that, method='html' uses the htmlNodeDumpFormatOutput function from libxml. This function in turn either expects text it receives to already have unsafe characters escaped or it escapes them itself. Unfortunately, its deifnition of unsafe is overbroad -- it includes all non-ascii characters in that definition. Therefore, in order to achieve the same result as html5lib, html5-parser would have to escape XML unsafe characters itself, which is a huge and unneccessary performance hit.
code:
the result is: