jupyter / nbconvert

Jupyter Notebook Conversion
https://nbconvert.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1.72k stars 563 forks source link

Bleach seems to be significantly slower than lxml in 7.1.x+ #1892

Open MgenGlder opened 1 year ago

MgenGlder commented 1 year ago

Description

After updating nbconvert from 6.5.0 to 7.2.2 for use here at GitHub we discovered the notebook rendering service degraded in render time (almost twice as long to render). We looked into what might have caused this and it looks as though the switch to bleach for sanitization is the culprit here. I'm wondering if we could provide a configuration for users to pass that would allow them to choose the cleaning library they would prefer to use considering it could have a major performance impact.

In the meantime our team should be able to safely upgrade to 7.0.0 from 6.5.0 though moving forward we'd love to keep in close sync with the latest releases, a couple steps behind if not update to date.

I recognize this may have been a deliberate design decision, I'm interested in looking into making an PR myself. Any pointers/recommendations would be appreciated 🙇🏾 .

Relevant PR: https://github.com/jupyter/nbconvert/pull/1854

Screenshots

7.0.0 Profile Screen Shot 2022-10-27 at 2 32 22 PM

7.0.0 Flamegraph Screen Shot 2022-10-27 at 2 31 18 PM

7.2.2 Profile Screen Shot 2022-10-27 at 12 26 31 PM 7.2.2 Flamegraph Screen Shot 2022-10-27 at 12 26 14 PM

MgenGlder commented 1 year ago

Found the relevant PR here: https://github.com/jupyter/nbconvert/pull/1854

gwincr11 commented 1 year ago

Interestingly we (GitHub) looked at using bleach as a second sanitization step on this project and decided against it because it caused too many notebooks to timeout and increased render times my more than 75%.

bollwyvl commented 1 year ago

Along the same lines of python-fast-jsonschema on nbformat, it seems reasonable to have a hard dependency on bleach, and nbconvert[fast] to force the install of lxml (or whatever ends up being the fastest, maybe some rust-based thing). This would prefer the faster library, or behind NBCONVERT_FAST=0 and/or a traitlet.

The two could then be pitted against each other during test, probably after some further homogenization.

lxml isn't really that uncommon a dependency, and is rather widely supported (including pyodide).

akx commented 1 year ago

Hi, author of #1854 here. Didn't imagine I'd open such a can of worms (#1863, #1849...)... The issue is lxml is presently a bit finicky to compile, and they don't currently e.g. ship wheels for macOS, so it felt like a good idea to switch from a binary library where only one function was being used to another library that was shipped with nbconvert anyway...

bollwyvl commented 1 year ago

No worries. We can't make everybody happy all of the time. Given "simple" and "fast," the former should win as a minimum-testable-product, but having a documented approach for the latter is important.... so many different use cases.

ship wheels for macOS

I guess with free software on non-free operating systems, you get what you pay for.

finicky to compile

Of note, the lxml fields are fairly green over in conda-forge-land, and it's likely to have even 3.11 builds for many tricky python dependencies before they are canonically available as wheels. Indeed, the migration is just up to lxml right now... only a few thousand packages to go... but likely will reach a halting state in a reasonable amount of time. Meanwhile, thousands of upstream maintainers are unlikely to hop to building wheels for your chosen operating system. Indeed, the conda-forge migration will likely identify internally-consistent patches to overcome temporary setbacks, as its unlikely everyone has been testing against the rcs... these patches are often PR'd upstream, and sometimes even merged.

akx commented 1 year ago

@bollwyvl Meh, the "get what you pay for" argument is a bit naff, since lxml used to ship wheels for macOS too up until there was a problem with wheels marked as multiarch not actually being so, so they were pulled and... https://bugs.launchpad.net/lxml/+bug/1913032 I should also note that lxml wheels are shipped for Windows (and actually pypy on macOS too...)

There's also a bit of discussion (can't find a link right now) around whether or not libxml, libxslt and the other C friends required by lxml should be linked statically into the lxml module shipped by a PyPI wheel – this is of extra interest when there are (security) bugs in those libraries, see e.g. https://github.com/advisories/GHSA-wrxv-2j5q-m38w, and if the wheels were dynamically linked, you'd easily bump into ABI incompatibilities (I think this happened with xmlsec).

akx commented 1 year ago

For the time being, a workaround, if you're willing to try it, is to patch lxml back into place in the filter dict:

from nbconvert.exporters.templateexporter import default_filters
from lxml.html.clean import clean_html
default_filters["clean_html"] = clean_html

This essentially reverts the change made to the filters dict in b40bb13a44d5632f20e8bb6d13049dccb6dbfa75.

It should be noted, though, that lxml.clean_html may be going extinct: https://bugs.launchpad.net/lxml/+bug/1958539

gwincr11 commented 1 year ago

We could help by setting up a devcontainer for codespaces development of this repo, then folks on a mac could just use a pre-configured linux environment for this.

cc @craigpeters @cmuto09

bollwyvl commented 1 year ago

devcontainer for codespaces

Not to be too blunt, but containers and SaaS solutions sweep these problem under the covers, and are not viable in a number of settings or for a number of users. Feel free to propose and maintain such a solution in perpetuity.

But I feel the issue at hand is not ease of development on this repo, but balancing performance and deployment simplicity on millions of devices: nbconvert is used on disconnected RaspberryPis and exotic, airgapped big iron to show what scientists, analysts, and other non-developers, and their machines have been doing.

gwincr11 commented 1 year ago

You know far more about how this is used than I do. I just thought it may help in some capacity.

gwincr11 commented 1 year ago

I would imagine that Bleach is painfully slow on a Raspberry PI if it is this slow on a powerful machine 😓

bollwyvl commented 1 year ago

painfully slow

But can be easily installed, works, and requires no porting to new platforms where python works.

As suggested, lxml is the low-hanging fruit, and widely used (e.g. by mypy reporting) but as noted, a blocklist-based sanitizer is less desirable for this task.

nh3, a python binding to rust's ammonia claim to be 15x faster than bleach, but would need some more knobs to twiddle to work for the nbconvert use case.

But even it that worked gangbusters, and generated identical output to bleach, rust doesn't even support some architectures, much less would one expect the maintainer to build binary wheels built for them... and this is going to cause trouble for more exotic parts of the jupyter stack in the coming year. But this doesn't mean it should be an option for use cases like yours.

bollwyvl commented 1 year ago

As an update, nh3 0.2.0 is now sufficiently configurable to do what bleach is doing to HTML... except for the contents of style attributes/tags, which can contain all kinds of evil things.

It looks like a maximally secure approach would still have to use bleach if style appeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.

Further, as the stdlib XML parser is used in a number of parsing-related places, one might still want to prefer having lxml around, even if nh3 was used for the cleaning process.

So there's certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g. asv in place would be a big step forward.

MgenGlder commented 1 year ago

That makes sense to me- I imagine if nh3 gives the same structure and allow-list based sanitization then it would make a great optional sanitizer to use in cases like ours where performance for web is the concern.

So there's certainly an opportunity to garner a number of performance gains, but before doing so, having an actual benchmark suite with e.g. asv in place would be a big step forward.

On board with this, benchmarking the uses of each library and comparing and contrasting their tradeoffs would be great for confidence and documenting the decision made for future contributors and library consumers that may want to try out different sanitizers.

bollwyvl commented 1 year ago

Of note: bleach 6.0 out (with some breaking changes, the impact of which I haven't yet assessed), but is also now officially deprecated. We'll need to do something within the next year or so.

xmo-odoo commented 1 year ago

It looks like a maximally secure approach would still have to use bleach if style appeared anywhere, but could use a faster, allowlist-compatible sanitizer, and generate the same XML DOM.

FWIW the underlying Ammonia library allows essentially arbitrary rewriting of attributes, though it has a fair amount of special cases which don't use this engine (e.g. classes rewriting), and no special support for @style that I could see.

For nbconvert two options would be to either have the nh3 API expanded for this (and possibly other things, nh3 is currently quite limited), or to depend on Ammonia directly and build its own sanitizer with the additional build and distribution complexity that entails as nbviewer would now be a non-native wheel. Though this could also solve distribution issues related to nh3: distros which provide nbconvert would need to add nh3 to their packages to support an nh3-based nbconvert. I'm not sure what the policies are for build-time dependencies but e.g. ammonia is already in debian testing.

A third option would be to get lxml to use ammonia for cleaning internally, but