spatialaudio / nbsphinx

:ledger: Sphinx source parser for Jupyter notebooks
https://nbsphinx.readthedocs.io/
MIT License
453 stars 130 forks source link

Avoid sphinx searching on output cells #777

Closed pfebrer closed 7 months ago

pfebrer commented 7 months ago

In our documentation, we have some notebooks rendered by nbsphinx which include plotly plots. The output cells of the notebook include the full plotly javascript library. When we use sphinx's search bar in our documentation, we get hits for plotly's javascript (circled in red):

Screenshot from 2024-02-22 16-08-58

Is there any way to avoid this?

mgeier commented 7 months ago

Can you please show the HTML of the affected page?

It looks like HTML tags are supposed to get stripped, see https://github.com/sphinx-doc/sphinx/commit/53ea1cb2808e90b51f0ed9468740a34c00decc2a

It would be ideal if you could reproduce this with the raw directive (without using nbsphinx), then you could raise this as a Sphinx issue.

pfebrer commented 7 months ago

Ok, so what I understand from the code there is that in principle everything inside a script tag is ignored, right?

Thanks, I'll try to dig deeper!

pfebrer commented 7 months ago

This is the page: https://zerothi.github.io/sisl/visualization/viz_module/showcase/GeometryPlot.html#GeometryPlot

(I can't upload html files to github)

And if I grep on that html file:

grep -n "not have a valid GeoJSON geometry" geometry_plot.html | cut -d : -f 1

I get a match on line 208, which is where the plotly library is included inside a script tag.

mgeier commented 7 months ago

Thanks for the link!

BTW, the "download ipynb" link is broken: https://raw.githubusercontent.com/zerothi/sisl/main//home/runner/work/sisl/sisl/docs/visualization/viz_module/showcase/GeometryPlot.ipynb

I guess it is meant to be this: https://raw.githubusercontent.com/zerothi/sisl/main/docs/visualization/viz_module/showcase/GeometryPlot.ipynb

However, this doesn't contain the outputs. Can you please provide the .iypnb file with outputs?

pfebrer commented 7 months ago

Yes, I'll send it to you as soon as I get home 👍

Thanks for the broken link report!

pfebrer commented 7 months ago

Here it is: GeometryPlot.zip

mgeier commented 7 months ago

Thanks for the notebook file!

Playing around with that, I could reduce this to a pure Sphinx problem: https://github.com/sphinx-doc/sphinx/issues/12052

It looks like the <script> tag is indeed ignored when building the search index, but it is not ignored in the search preview.

Note that in your example the Plotly stuff is only shown because the word "geometry" is also used somewhere else on the page. If you search for "GeoJSON", you'll find nothing, even though the word is right next to "geometry".

pfebrer commented 7 months ago

Thank you very much! That's an interesting bug 😅

I guess I can close this then 👍

mgeier commented 7 months ago

That's an interesting bug

Yes indeed!

It is a dangerous pattern to look out for: there is one piece of data (in our case the HTML source text) and there are two sub-systems handling that data separately (in our case the search index generation and the search preview generation). Those two systems are supposed to have the same behavior, but if they don't, we have a problem.

This reminds me of a vulnerability of the librsvg library I've read recently: https://nvd.nist.gov/vuln/detail/CVE-2023-38633 In that case, the common piece of data was a URL, which was rejected by one sub-system, but not by another, which resulted in a potential exploit.

zerothi commented 6 months ago

Love this! Thanks!