rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.11k stars 68 forks source link

Segmentation fault with unwrap_tags #32

Closed satchamo closed 3 years ago

satchamo commented 3 years ago

First, thanks for writing and maintaining this speedy library. It makes a huge difference when parsing thousands of documents compared to other parsers.

Anyway, while attempting to strip the data tag from some content, I noticed that the library seems to choke on this specific tag. Here's some code to reproduce it:

from selectolax.parser import HTMLParser
html = """test"""
tree = HTMLParser(html)
tree.unwrap_tags(["data"])

Output: Segmentation fault (core dumped)

I'm running selectolax==0.2.6 on Python 3.6.9

rushter commented 3 years ago

I've fixed the problem in 0.2.10. If you don't want to keep the content of the data tags, you need to use strip_tags.