Open immerrr opened 7 years ago
Thanks for reporting @immerrr !
It does not look straightforward to fix though.
html5lib
does the replacement clearly,
while with libxml2 HTMLParser
it seems this case is not handled.
Maybe one could use the parser target interface to intercept the data and replace the chars, but I don't know about the processing penalty. Sample code:
>>> import string
>>>
>>> import lxml.etree
>>> from html5lib.constants import replacementCharacters
>>>
>>> table = {unichr(i): r for i, r in replacementCharacters.items()}
>>>
>>> def charref_replace(s):
... out = u''
... for c in s:
... if c in table:
... out += table[c]
... else:
... out += c
... return out
...
>>> class ReservedReplacementTarget(lxml.etree.TreeBuilder):
... def data(self, data):
... return super(ReservedReplacementTarget, self).data(charref_replace(data))
...
>>> parser = lxml.etree.HTMLParser(target = ReservedReplacementTarget())
>>> print(lxml.etree.fromstring('<p>hello, … world!</p>', parser=parser).xpath('//p')[0].text)
hello, … world!
The entities marked as reserved here (scroll down to see the list) are extracted literally by
lxml
, whereas it should probably strive for more compatibility with browsers which interpret them according to CP1252.A quick example:
whereas modern browsers usually show it as an ellipsis
…
: