rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.11k stars 68 forks source link

text() always decodes HTML entities #22

Open kiwijam opened 4 years ago

kiwijam commented 4 years ago

As far as I can tell, there's no easy way to extract text but preserve HTML entity encoding at the moment.

Having that option would be handy!

from selectolax.parser import HTMLParser
from html import escape

html = HTMLParser('<div>&#x3C;test&#x3E;</div>')
print(html.text())
print(escape(html.text()))
rushter commented 4 years ago

I think I can't control it, since Modest performs some preprocessing but I can be wrong.

lexborisov commented 4 years ago

@kiwijam @rushter

In Modest we have buffer positions for attributes in tokens You can use this for get raw data.

rushter commented 4 years ago

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

ichux commented 4 years ago

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

rushter commented 4 years ago

Added limited support for this in 0.2.7.

>>> html_parser = HTMLParser('<div>&#x3C;test&#x3E;</div>')
>>> selector = html_parser.css_first('div')
>>> selector.child.html
'&lt;test&gt;'
>>> selector.child.raw_value
b'&#x3C;test&#x3E;'

This is limited to text nodes only for now.

Thanks for your work done. How can I join in the maintenance of the library. I would like to be of help so that more features can be added.

Well, It's open-source. You are welcome to propose new features or improve existing ones.

You can improve the new raw_value feature to support arbitrary nodes. That's a pretty easy task, but you will need to be familiar with the C language and Modest library though.