scrapy / parsel

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors
BSD 3-Clause "New" or "Revised" License
1.15k stars 146 forks source link

Work around incorrect extraction of "reserved" HTML entities #76

Open immerrr opened 7 years ago

immerrr commented 7 years ago

The entities marked as reserved here (scroll down to see the list) are extracted literally by lxml, whereas it should probably strive for more compatibility with browsers which interpret them according to CP1252.

A quick example:


In [13]: etree.fromstring ('<p>&#133;</p>').text
Out[13]: u'\x85'

whereas modern browsers usually show it as an ellipsis :

In [5]: u'\u2026'
Out[5]: '…'
redapple commented 7 years ago

Thanks for reporting @immerrr ! It does not look straightforward to fix though. html5lib does the replacement clearly, while with libxml2 HTMLParser it seems this case is not handled.

Maybe one could use the parser target interface to intercept the data and replace the chars, but I don't know about the processing penalty. Sample code:

>>> import string
>>> 
>>> import lxml.etree
>>> from html5lib.constants import replacementCharacters
>>> 
>>> table = {unichr(i): r for i, r in replacementCharacters.items()}
>>> 
>>> def charref_replace(s):
...     out = u''
...     for c in s:
...         if c in table:
...             out += table[c]
...         else:
...             out += c
...     return out
... 
>>> class ReservedReplacementTarget(lxml.etree.TreeBuilder):
...     def data(self, data):
...         return super(ReservedReplacementTarget, self).data(charref_replace(data))
... 
>>> parser = lxml.etree.HTMLParser(target = ReservedReplacementTarget())
>>> print(lxml.etree.fromstring('<p>hello, &#133; world!</p>', parser=parser).xpath('//p')[0].text)
hello, … world!