Closed frenzymadness closed 2 months ago
Just an update on this. The latest version of lxml
(5.2.0) no longer contains the HTML cleaner. Its code is now available as a dedicated project on GitHub and PyPI.
If you want to continue using it, you can either depend on lxml[html_clean]
or on lxml_html_clean
directly. lxml
contains backward-compatible import so there is nothing else you need to change than the dependency.
Sounds like what I would do if I would create a pip package with an exploit
I'm not sure what gives you that impression. We implemented a backward-incompatible change in lxml and gave you a lot of time to think about the best approach to your project.
If you do nothing and install the latest lxml, this line will raise an exception: https://github.com/kootenpv/sky/blob/e4abc5d14db01f54dcf5d974355fa2f5fdd395dc/sky/helper.py#L9 because the cleaner is no longer a part of the original lxml project.
You can check the dedicated project here: https://github.com/fedora-python/lxml_html_clean
I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.
The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.
Two viable alternatives worth considering are
bleach
andnh3
. Here's why:bleach:
nh3:
We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.
Let me know if we can help you with this transition anyhow and have a nice day.