spatialaudio / nbsphinx

:ledger: Sphinx source parser for Jupyter notebooks
https://nbsphinx.readthedocs.io/
MIT License
457 stars 130 forks source link

hyperlinks in jupyter notebook are missing when converted to html with nbsphinx+sphinx #468

Open Waerden001 opened 4 years ago

Waerden001 commented 4 years ago

I use sphinx with nbsphinx to generate HTML files from Jupyter Notebook files. But hyperlinks in the notebook doesn't show up in the converted html file. More precisely

Does nbsphinx keep the hyperlinks in the notebook when used in sphinx as an extension?

mgeier commented 4 years ago

HTML-style hyperlinks are currently not supported by nbsphinx.

You can use one of these instead:

https://google.com

<https://google.com>

[Google](https://google.com)

[Google][1]

[1]: https://google.com

[Google]

[Google]: https://google.com
MiniXC commented 4 years ago

Why is this the case, when regular Markdown supports html tags like <a>? This can be useful when for example wanting to use a hyperlink with a class. I noticed many other tags are stripped as well, e.g. <em>, <strong>, <article>... And writing <p>test</p> results in <p></p><p>test</p><p></p> which is unexpected. Is there a workaround for this other than using raw NBConvert html cells?

From the documentation image https://nbsphinx.readthedocs.io/en/0.4.1/markdown-cells.html#HTML-Elements-(HTML-only) image https://nbsphinx.readthedocs.io/en/0.4.1/raw-cells.html#HTML

Maybe these parts of the documentation should be clarified if certain html tags are stripped.

mgeier commented 4 years ago

Why is this the case, when regular Markdown supports html tags like <a>?

Simply because nobody has implemented it yet. And until this very issue, nobody has requested it either.

This is quite easy to implement if you just want to simply convert Markdown to HTML and nothing else.

In the case of nbsphinx it is a bit more complicated, though. The Markdown content is first converted (by pandoc plus some AST manipulations) to reStructuredText which is then converted to the internal representation of Sphinx/docutils. From this internal representation, Sphinx can generate HTML and LaTeX (and EPUB, and ...) output files (involving some further custom manipulations).

Raw HTML snippets which are just passed through will be missing in the LaTeX output.

There are already two special cases implemented which also work with LaTeX output: <img> and <div class="alert alert-...">.

Theoretically, a third special case for <a> could be added.

This can be useful when for example wanting to use a hyperlink with a class.

I guess this could be implemented. Do you want to make a PR?

I noticed many other tags are stripped as well, e.g. <em>, <strong>, <article>...

I guess they get lost in the conversion from Markdown to reStructuredText.

I think they are swallowed by pandoc. I don't know if it's possible to avoid that.

In the long term, I'd like to avoid the intermediate reStructuredText representation (and the use of pandoc), see #36 (but this might still take quite a while). But then it might be easier to fix this.

And writing <p>test</p> results in <p></p><p>test</p><p></p> which is unexpected.

OK, that's strange, that's probably an artifact caused by the use of the various tools mentioned above.

Is there a workaround for this other than using raw NBConvert html cells?

You can write something like this in your Markdown cell:

<div class="my-class">

[Google](https://google.com)

</div>

The <div> tags will survive the conversion and then you should be able to use a CSS selector like .my-class a to select the link.

Alternatively, you could try if MyST-NB handles this situation more to your liking.

You can also try RunNotebook (which uses a more direct Markdown-to-HTML conversion) or any of the alternatives mentioned in https://nbsphinx.readthedocs.io/en/0.7.0/links.html.

Maybe these parts of the documentation should be clarified if certain html tags are stripped.

Yes, definitely, the documentation is missing some important information here!

Would you like to make a PR to fix this?

MiniXC commented 4 years ago

I will look into pandoc and see if there are options for converting html tags, that might be the cleanest solution. Regarding the documentation: not just div seems to be supported, but audio and some others as well. If you know by any chance where these special html tags are converted to rst that would be a great help. Happy to make a PR for the docs, not sure if making an exception for just a tags would be worth it though.

MiniXC commented 4 years ago

Not sure if that might be out-of-scope for this issue, but my original use-case for <a> tags was that I wanted to replicate automatically linking to classes generated with autodoc as is possible in rst, e.g.:

:class:`.SomeClass`

And my specific problem was that I could not replicate the html the above line would generate in markdown. Long story short, pandoc actually extends markdown and accommodates this case with

`.SomeClass`{.interpreted-text role="class"}

This won't be nicely displayed in a notebook, but that would have been a long shot either way.

I think the documentation should more clearly say that Markdown cells are treated as pandoc markdown, I will submit a PR for that later.

Interestingly enough, any pandoc markdown that involves div, does not seem be supported by nbsphinx (maybe there is custom code for div in place?)

If I'm not mistaken, one could even add autodoc using the following:

<div class="automodule" data-members="" data-undoc-members="" data-show-inheritance="">

some_module.submodule

</div>
mgeier commented 4 years ago

Regarding the documentation: not just div seems to be supported, but audio and some others as well.

Yes, I think <audio> and <video> are the most relevant, that's why I'm showing them in https://nbsphinx.readthedocs.io/en/0.7.0/markdown-cells.html#HTML-Elements-(HTML-only).

If you know by any chance where these special html tags are converted to rst that would be a great help.

The pandoc options are here:

https://github.com/spatialaudio/nbsphinx/blob/992d55504d8cf52de62a47c9fef4fdc157434314/src/nbsphinx.py#L1362-L1366

The +raw_html setting passes some HTML tags (but apparently not all?) through.

Then there is some special handling for citations ans <img> tags, but <audio> and <video> don't need special handling.

You can check pandocs behavior like this:

$ pandoc -f markdown-native_divs -t rst
<div>bla</div>
^D
.. raw:: html

   <div>

bla

.. raw:: html

   </div>

Note that for (future) CommonMark compatibility, blank lines should be used inside the <div> tags:

$ pandoc -f commonmark -t rst
<div>bla</div>
.. raw:: html

   <div>bla</div>

vs.

$ pandoc -f commonmark -t rst
<div>

bla

</div>
^D
.. raw:: html

   <div>

bla

.. raw:: html

   </div>

Happy to make a PR for the docs,

That would be great!

not sure if making an exception for just a tags would be worth it though.

I don't know, probably not.

Not sure if that might be out-of-scope for this issue, but my original use-case for <a> tags was that I wanted to replicate automatically linking to classes generated with autodoc as is possible in rst, e.g.:

:class:`.SomeClass`

My work-around for autodoc links is https://nbsphinx.readthedocs.io/en/0.7.0/markdown-cells.html#Links-to-Domain-Objects.

This is of course not as simple as :class:`SomeClass`, but the advantage is that the links also look somewhat reasonable in JupyterLab/nbviewer/Github.

I think the documentation should more clearly say that Markdown cells are treated as pandoc markdown, I will submit a PR for that later.

I would prefer not mentioning pandoc, because it is just an implementation detail which will be removed in the (rather far) future.

I think it would be better to mention a few tags that work (e.g. <div>, <audio>) and vaguely mention that not all tags work.

This way we are open for future changes in behavior.

Interestingly enough, any pandoc markdown that involves div, does not seem be supported by nbsphinx (maybe there is custom code for div in place?)

nbsphinx uses the -native_divs option, maybe that's the culprit?

The raw <div> tags are parsed in the ReplaceAlertDivs transform, in order to find "alert" divs which are turned into "notes"/"warnings".

But all other <div> elements should be passed through?

If I'm not mistaken, one could even add autodoc using the following [...]

You mean instead of using the automodule directive?

Why not just use a raw reST cell (or a separate reST source file) for that?

Waerden001 commented 4 years ago

Regarding the documentation: not just div seems to be supported, but audio and some others as well.

Yes, I think <audio> and <video> are the most relevant, that's why I'm showing them in https://nbsphinx.readthedocs.io/en/0.7.0/markdown-cells.html#HTML-Elements-(HTML-only).

If you know by any chance where these special html tags are converted to rst that would be a great help.

The pandoc options are here:

https://github.com/spatialaudio/nbsphinx/blob/992d55504d8cf52de62a47c9fef4fdc157434314/src/nbsphinx.py#L1362-L1366

The +raw_html setting passes some HTML tags (but apparently not all?) through.

My use of a markdown cell is usually just a mixture of plain text, HTML tags, images and Latex code, nbsphinx + sphinx handle everything smoothly except those tiny HTML tags, so is it possible to handle more HTML tags like <a> by just modifying the +raw_html settings a little bit?

mgeier commented 4 years ago

I don't know. Probably. How would you modify them?

MiniXC commented 4 years ago

My work-around for autodoc links is https://nbsphinx.readthedocs.io/en/0.7.0/markdown-cells.html#Links-to-Domain-Objects.

I saw that workaround, unfortunately it does not replicate the styling that is applied when linking to domain objects in sphinx. Functionally it does the same though, so it is a solution.

mgeier commented 4 years ago

I saw that workaround, unfortunately it does not replicate the styling that is applied when linking to domain objects in sphinx.

Yeah, I know, the problem is that reST doesn't allow nested markup, see #301. This will hopefully become possible when #36 is solved, but this might take some more time ...