Work around incorrect extraction of "reserved" HTML entities

Thanks for reporting @immerrr ! It does not look straightforward to fix though. html5lib does the replacement clearly, while with libxml2 HTMLParser it seems this case is not handled.

Maybe one could use the parser target interface to intercept the data and replace the chars, but I don't know about the processing penalty. Sample code:

>>> import string
>>> 
>>> import lxml.etree
>>> from html5lib.constants import replacementCharacters
>>> 
>>> table = {unichr(i): r for i, r in replacementCharacters.items()}
>>> 
>>> def charref_replace(s):
...     out = u''
...     for c in s:
...         if c in table:
...             out += table[c]
...         else:
...             out += c
...     return out
... 
>>> class ReservedReplacementTarget(lxml.etree.TreeBuilder):
...     def data(self, data):
...         return super(ReservedReplacementTarget, self).data(charref_replace(data))
... 
>>> parser = lxml.etree.HTMLParser(target = ReservedReplacementTarget())
>>> print(lxml.etree.fromstring('<p>hello, &#133; world!</p>', parser=parser).xpath('//p')[0].text)
hello, … world!

scrapy / parsel

Work around incorrect extraction of "reserved" HTML entities #76