kootenpv / sky

:sunrise: next generation web crawling using machine intelligence
BSD 3-Clause "New" or "Revised" License
329 stars 44 forks source link

Consider switching from lxml's clean_html for enhanced security (and possibly performance) #18

Closed frenzymadness closed 2 months ago

frenzymadness commented 1 year ago

I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.

The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.

Two viable alternatives worth considering are bleach and nh3. Here's why:

bleach:

nh3:

We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.

Let me know if we can help you with this transition anyhow and have a nice day.

frenzymadness commented 7 months ago

Just an update on this. The latest version of lxml (5.2.0) no longer contains the HTML cleaner. Its code is now available as a dedicated project on GitHub and PyPI.

If you want to continue using it, you can either depend on lxml[html_clean] or on lxml_html_clean directly. lxml contains backward-compatible import so there is nothing else you need to change than the dependency.

kootenpv commented 2 months ago

Sounds like what I would do if I would create a pip package with an exploit

frenzymadness commented 2 months ago

I'm not sure what gives you that impression. We implemented a backward-incompatible change in lxml and gave you a lot of time to think about the best approach to your project.

If you do nothing and install the latest lxml, this line will raise an exception: https://github.com/kootenpv/sky/blob/e4abc5d14db01f54dcf5d974355fa2f5fdd395dc/sky/helper.py#L9 because the cleaner is no longer a part of the original lxml project.

You can check the dedicated project here: https://github.com/fedora-python/lxml_html_clean