Open MgenGlder opened 1 year ago
Found the relevant PR here: https://github.com/jupyter/nbconvert/pull/1854
Interestingly we (GitHub) looked at using bleach as a second sanitization step on this project and decided against it because it caused too many notebooks to timeout and increased render times my more than 75%.
Along the same lines of python-fast-jsonschema
on nbformat
, it seems reasonable to have a hard dependency on bleach
, and nbconvert[fast]
to force the install of lxml
(or whatever ends up being the fastest, maybe some rust-based thing). This would prefer the faster library, or behind NBCONVERT_FAST=0
and/or a traitlet.
The two could then be pitted against each other during test, probably after some further homogenization.
lxml
isn't really that uncommon a dependency, and is rather widely supported (including pyodide
).
Hi, author of #1854 here. Didn't imagine I'd open such a can of worms (#1863, #1849...)... The issue is lxml
is presently a bit finicky to compile, and they don't currently e.g. ship wheels for macOS, so it felt like a good idea to switch from a binary library where only one function was being used to another library that was shipped with nbconvert
anyway...
No worries. We can't make everybody happy all of the time. Given "simple" and "fast," the former should win as a minimum-testable-product, but having a documented approach for the latter is important.... so many different use cases.
ship wheels for macOS
I guess with free software on non-free operating systems, you get what you pay for.
finicky to compile
Of note, the lxml
fields are fairly green over in conda-forge
-land, and it's likely to have even 3.11 builds for many tricky python dependencies before they are canonically available as wheels. Indeed, the migration is just up to lxml
right now... only a few thousand packages to go... but likely will reach a halting state in a reasonable amount of time. Meanwhile, thousands of upstream maintainers are unlikely to hop to building wheels for your chosen operating system. Indeed, the conda-forge migration will likely identify internally-consistent patches to overcome temporary setbacks, as its unlikely everyone has been testing against the rc
s... these patches are often PR'd upstream, and sometimes even merged.
@bollwyvl Meh, the "get what you pay for" argument is a bit naff, since lxml
used to ship wheels for macOS too up until there was a problem with wheels marked as multiarch not actually being so, so they were pulled and... https://bugs.launchpad.net/lxml/+bug/1913032
I should also note that lxml
wheels are shipped for Windows (and actually pypy on macOS too...)
There's also a bit of discussion (can't find a link right now) around whether or not libxml, libxslt and the other C friends required by lxml should be linked statically into the lxml module shipped by a PyPI wheel – this is of extra interest when there are (security) bugs in those libraries, see e.g. https://github.com/advisories/GHSA-wrxv-2j5q-m38w, and if the wheels were dynamically linked, you'd easily bump into ABI incompatibilities (I think this happened with xmlsec).
For the time being, a workaround, if you're willing to try it, is to patch lxml
back into place in the filter dict:
from nbconvert.exporters.templateexporter import default_filters
from lxml.html.clean import clean_html
default_filters["clean_html"] = clean_html
This essentially reverts the change made to the filters dict in b40bb13a44d5632f20e8bb6d13049dccb6dbfa75.
It should be noted, though, that lxml.clean_html
may be going extinct: https://bugs.launchpad.net/lxml/+bug/1958539
We could help by setting up a devcontainer for codespaces development of this repo, then folks on a mac could just use a pre-configured linux environment for this.
cc @craigpeters @cmuto09
devcontainer for codespaces
Not to be too blunt, but containers and SaaS solutions sweep these problem under the covers, and are not viable in a number of settings or for a number of users. Feel free to propose and maintain such a solution in perpetuity.
But I feel the issue at hand is not ease of development on this repo, but balancing performance and deployment simplicity on millions of devices: nbconvert
is used on disconnected RaspberryPis and exotic, airgapped big iron to show what scientists, analysts, and other non-developers, and their machines have been doing.
You know far more about how this is used than I do. I just thought it may help in some capacity.
I would imagine that Bleach is painfully slow on a Raspberry PI if it is this slow on a powerful machine 😓
painfully slow
But can be easily installed, works, and requires no porting to new platforms where python works.
As suggested, lxml
is the low-hanging fruit, and widely used (e.g. by mypy
reporting) but as noted, a blocklist-based sanitizer is less desirable for this task.
nh3
, a python binding to rust's ammonia
claim to be 15x faster than bleach
, but would need some more knobs to twiddle to work for the nbconvert
use case.
But even it that worked gangbusters, and generated identical output to bleach
, rust
doesn't even support some architectures, much less would one expect the maintainer to build binary wheels built for them... and this is going to cause trouble for more exotic parts of the jupyter stack in the coming year. But this doesn't mean it should be an option for use cases like yours.
As an update, nh3 0.2.0
is now sufficiently configurable to do what bleach is doing to HTML... except for the contents of style
attributes/tags, which can contain all kinds of evil things.
It looks like a maximally secure approach would still have to use bleach
if style
appeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.
Further, as the stdlib XML parser is used in a number of parsing-related places, one might still want to prefer having lxml
around, even if nh3
was used for the cleaning process.
So there's certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g. asv
in place would be a big step forward.
That makes sense to me- I imagine if nh3
gives the same structure and allow-list based sanitization then it would make a great optional sanitizer to use in cases like ours where performance for web is the concern.
So there's certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g. asv in place would be a big step forward.
On board with this, benchmarking the uses of each library and comparing and contrasting their tradeoffs would be great for confidence and documenting the decision made for future contributors and library consumers that may want to try out different sanitizers.
Of note: bleach 6.0 out (with some breaking changes, the impact of which I haven't yet assessed), but is also now officially deprecated. We'll need to do something within the next year or so.
It looks like a maximally secure approach would still have to use
bleach
ifstyle
appeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.
FWIW the underlying Ammonia library allows essentially arbitrary rewriting of attributes, though it has a fair amount of special cases which don't use this engine (e.g. classes rewriting), and no special support for @style
that I could see.
For nbconvert two options would be to either have the nh3 API expanded for this (and possibly other things, nh3 is currently quite limited), or to depend on Ammonia directly and build its own sanitizer with the additional build and distribution complexity that entails as nbviewer would now be a non-native wheel. Though this could also solve distribution issues related to nh3: distros which provide nbconvert would need to add nh3 to their packages to support an nh3-based nbconvert. I'm not sure what the policies are for build-time dependencies but e.g. ammonia is already in debian testing.
A third option would be to get lxml to use ammonia for cleaning internally, but
Description
After updating
nbconvert
from 6.5.0 to 7.2.2 for use here at GitHub we discovered the notebook rendering service degraded in render time (almost twice as long to render). We looked into what might have caused this and it looks as though the switch tobleach
for sanitization is the culprit here. I'm wondering if we could provide a configuration for users to pass that would allow them to choose the cleaning library they would prefer to use considering it could have a major performance impact.In the meantime our team should be able to safely upgrade to 7.0.0 from 6.5.0 though moving forward we'd love to keep in close sync with the latest releases, a couple steps behind if not update to date.
I recognize this may have been a deliberate design decision, I'm interested in looking into making an PR myself. Any pointers/recommendations would be appreciated 🙇🏾 .
Relevant PR: https://github.com/jupyter/nbconvert/pull/1854
Screenshots
7.0.0 Profile
7.0.0 Flamegraph
7.2.2 Profile 7.2.2 Flamegraph