Open frenzymadness opened 1 year ago
Thank you for notification!
I think in my case this is not relevant, because I use clean_html to "prepare html" before uploading to telegra.ph, which obviously uses its own sanitizer.
However, I tried nh3 and it fits almost well. Just broke 2 of my tests. For some reason nh3 (more likely amonia) removes leading line break inside pre tag: <pre>\n abc</pre>
becomes <pre> abc</pre>
Not a big deal, but I think this is wrong behavior.
Just an update on this. The latest version of lxml
(5.2.0) no longer contains the HTML cleaner. Its code is now available as a dedicated project on GitHub and PyPI.
If you want to continue using it, you can either depend on lxml[html_clean]
or on lxml_html_clean
directly. lxml
contains backward-compatible import so there is nothing else you need to change than the dependency.
I'd like to bring to your attention that we are discussing the possibility of removing lxml's clean_html functionality from lxml library. Over the past years, there have been several concerning security vulnerabilities discovered within the lxml library's clean_html functionality – CVE-2021-43818, CVE-2021-28957, CVE-2020-27783, CVE-2018-19787 and CVE-2014-3146.
The main problem is in the design. Because the lxml's clean_html functionality is based on a blocklist, it's hard to keep it up to date with all new possibilities in HTML and JS.
Two viable alternatives worth considering are
bleach
andnh3
. Here's why:bleach:
nh3:
We'll probably move the cleaning part of the lxml to a distinct project first so it will still be possible to use it but better is to find a suitable alternative sooner rather than later.
Let me know if we can help you with this transition anyhow and have a nice day.