rushter / selectolax

Python binding to Modest and Lexbor engines (fast HTML5 parser with CSS selectors).
MIT License
1.16k stars 70 forks source link

Allow regular expressions in `text_contains` / `any_text_contains` #92

Open kelvindecosta opened 1 year ago

kelvindecosta commented 1 year ago

I would like to find all script tags within a page with contain certain keywords.

Here's how it's done via bs4:

script_tags = soup.find_all("script", string=re.compile(r"window\.(keyword|another_keyword)"))

I'm not too sure how to go about it with selectolax.

I could use the matches provided by .select(css).text_contains(pattern) like so:

first_keyword_scripts = tree.select("script").text_contains("window.keyword").matches

However, if I were to do the same for the second keyword, it is difficult to create the original context created with bs4 in which:

Another use case for regular expressions is the ability to ignore keywords. It would be nice to have a text_does_not_contain function too.

I think these issues can be solved with regular expressions but I'm probably wrong.

I'd appreciate any feedback on how these examples can be run via selectolax. For now I have a workaround that uses the .text() operation, which isn't really great for large script texts,

Thanks for your time and for maintaining this project!

rushter commented 1 year ago

I think I can implement a callback interface so that you can provide a custom function that executes the regex internally.

kelvindecosta commented 1 year ago

@rushter , thank you so much for considering this issue!