spatialaudio / nbsphinx

:ledger: Sphinx source parser for Jupyter notebooks
https://nbsphinx.readthedocs.io/
MIT License
457 stars 130 forks source link

raw html (div) in markdown does not render in PDF #513

Open actsasgeek opened 4 years ago

actsasgeek commented 4 years ago

First, I love nbsphinx. Unlike many (most?), however, my target is PDFs. I find my students prefer PDF files over Notebooks or HTML for a variety of reasons but mostly because they can read them on their devices when they don't have internet access and they do, sometimes, kill a tree printing them out (as do I).

Second, I'm having a bit of trouble with raw HTML working in the translation from HTML to PDF (and I'm not sure exactly which part isn't working). When I have a div with a colored background like so:

Screen Shot 2020-10-23 at 3 51 26 PM

the result in the PDF is plain:

Screen Shot 2020-10-23 at 3 51 16 PM

I'm not completely clear on the entire infrastructure so I'm not sure where the problem may be. Based on what I could find, Sphinx (CommonMark) should support this (although to what extent, I'm not sure). Pandoc does as well (but only with markdown_strict, not the defaults).

Perhaps there's an easy fix/setting I'm missing?

mgeier commented 4 years ago

however, my target is PDFs

PDF support is also important for me.

Currently, divs are simply ignored for PDF output (like most other HTML tags), with one notable exception: https://nbsphinx.readthedocs.io/en/0.8.0/markdown-cells.html#Info/Warning-Boxes

Here's the PDF version: https://nbsphinx.readthedocs.io/_/downloads/en/0.8.0/pdf/#subsection.3.8

So if you use exactly <div class="alert alert-info"> or <div class="alert alert-warning"> you should get a colored box in HTML and a somewhat framed box in PDF.

The border and colors of the frames in the PDF can be specified like this:

https://github.com/spatialaudio/nbsphinx/blob/98005a9d6b331b7d6d14221539154df69f7ae51a/doc/conf.py#L135-L139

Is that sufficient for your needs? Do you have any ideas for improving this?

If you want to change the default colors or border widths, please make a PR.

actsasgeek commented 4 years ago

Matthias,

Thank you for your reply.

I generally use three levels (green, yellow, red...a cliché, I know and also not very colorblind friendly). For now, I could probably just use some sort of neutral color and embed an appropriate image (would that work, ok?). Although there would be some desire for formatting (I'm not a purist...a table?).

Is there a fundamental issue with supporting embedded HTML? I would have thought that was "out of the box" delegated to whatever translator was used. I just don't know enough about the inner workings to weigh in on a solution but I'd be willing to help if I could.

Cheers, Stephyn

actsasgeek commented 4 years ago

I did some digging and it's definitely them not you.

Sphinx

Sphinx uses recommonmark which supposedly implements the Common Mark standard. The Common Mark standard says that raw HTML tags in markdown should just be passed through. Recommonmark seems to eat them. The only way to get the content of a div to show up at all is to place new lines before and after the content of the div...but this causes it do weird things in the notebook.

Pandoc

1. markdown to pdf directly.

This ignores raw HTML formatting in general and the content of the div becomes escaped, pre-formatted text.

2. markdown to pdf using markdown_strict option

According to the documentation, the default for markdown is to ignore raw HTML tags in Markdown (contra the standard). However, you can specify -f markdown_strict to process raw HTML tags.

a. without lines after the <div> and before the </div> the output is removed. Raw formatting outside a div is ignored but the word does appear. Although this is surprising, it's not clear that "markdown_strict" applies to going straight from Markdown to LaTeX...it's only a claim about HTML. b. with lines, the output is still snarfed... :/

3. markdown to html using markdown_strict option (no weird line breaks)

As you might imagine, this is perfect so I thought, what if we go to LaTeX through HTML instead of directly there?

4. markdown to html to pdf using markdown_strict option (no weird line breaks)

This works really well...about 80% of the way there. raw HTML formatting outside divs is applied. The content of the div is formatted correctly. It even output the enclosed table (with somewhat odd formatting). However, the background color is removed.

If this pipeline were possible (I'm not sure) the question would be, is there a way to define and apply styles to divs by defining them in latex? CSS? and indicating where they should be applied to get the 20%?

mgeier commented 4 years ago

Have you been talking about recommonmark this whole time?

It has nothing to do with nbsphinx and I don't have anything to do with it.

If you need help with recommonmark you should probably ask at their issue tracker instead of here?

I was talking about using Markdown cells in Jupyter notebooks. There the boxes work as I've described (or rather linked to the docs) above.

Are you not using Jupyter notebooks?

If you want to use Markdown files but still get the boxes I'm talking about, you can use Jupytext with one of the supported Markdown based formats (see https://nbsphinx.readthedocs.io/en/0.8.0/custom-formats.html#Example:-Jupytext).

Is there a fundamental issue with supporting embedded HTML?

I guess it's not fundamental, but implementing a full HTML parser is not a small endeavor.

I would have thought that was "out of the box" delegated to whatever translator was used.

Well, if you use HTML as target format, the HTML snippets can easily be passed through, but what is LaTeX supposed to do with them?

For this to work with LaTeX output, the HTML actually has to be parsed. I'm doing this in nbsphinx for the mentioned special cases of <div> as well as for <img> tags (and probably some more?). However, this is just a very, very limited subset of HTML.

The only way to get the content of a div to show up at all is to place new lines before and after the content of the div

Yes, that's what the CommonMark standard demands. You should do that. See also the third bullet point there: https://nbsphinx.readthedocs.io/en/0.8.0/markdown-cells.html#Info/Warning-Boxes

...but this causes it do weird things in the notebook.

In this case you should open an issue in the appropriate issue tracker.

See also https://github.com/jupyter/nbconvert/issues/1125 and https://github.com/jupyter/notebook/issues/1292#issuecomment-570548806.

It would be really great if we could get proper support for those <div> elements in the whole ecosystem!

actsasgeek commented 4 years ago
  1. I have been talking about nbsphinx the entire time.

I think I've mentioned in comments above, I don't know how nbsphinx works or what it uses directly or indirectly. But I'm willing to learn. Because I had seen issues posted in the past that referenced both Sphinx and Pandoc and I looked to see what the possible underlying tools might do in a similar situation. This is what I have documented above. I thought it might be a helpful place to start a discussion about workarounds (see below).

It's not just <div> elements.

The Markdown standard is that all raw HTML elements should be processed. This includes <b>Foo</b> which was not always properly processed in the test cases listed above.

  1. Based on my experiments, the only way to faithfully render markdown appears to be convert it to HTML using markdown_strict and then the HTML to LaTeX. But again, I don't know what the pipeline is that nbsphinx uses so I can't be sure.

  2. Yes, it would be very nice if everything worked the way it was supposed to. There are two lines of attack in that regard:

1) workarounds in nbsphinx 2) raise issues in the other components of the ecosystem.

I don't know enough about either. I'm just reporting what I see.

  1. Those extra lines inside divs cause the formatting in the Notebook to do the wrong thing. Whatever renderer Jupyter uses for its Markdown, it is doing the right thing (mostly). Here is what the example warning looks like in Jupyter Notebook with the extra lines:
Screen Shot 2020-10-25 at 12 21 16 PM

which is not very aesthetically pleasing...if I remove the extra lines (so that the insides are not treated as Markdown:

Screen Shot 2020-10-25 at 12 22 24 PM

which is what I expect.

Ultimately, all I'm saying is that my expectation is that embedded HTML in a Notebook's Markdown Cell will be faithfully rendered in the PDF, within the limits of translation, and that I should not necessarily have to muck up one rendering to make another successful. It doesn't and I don't know why and I don't know if it can be fixed or if there's a workaround. And, again, this isn't just divs. Any raw HTML is ignored.

Thank you for your efforts.

mgeier commented 4 years ago

I have been talking about nbsphinx the entire time.

OK, good to know.

So are the special alert divs working for you or not?

The Markdown standard is that all raw HTML elements should be processed. This includes <b>Foo</b> which was not always properly processed in the test cases listed above.

Well there isn't really a Markdown standard (yet). And CommonMark, the closest thing to a standard, does explicitly say that it's not a full HTML parser.

It can just detect a few HTML-like structures and pass them along to whatever is supposed to display the result.

That works well if HTML is the end result, but it decidedly doesn't work for some other formats like LaTeX.

Any HTML support for LaTeX output would have to be individually implemented, which I've done for very few cases, as mentioned above.

If you have suggestions for further special cases that should be supported, please let me know!

Based on my experiments, the only way to faithfully render markdown appears to be convert it to HTML using markdown_strict and then the HTML to LaTeX.

There are certainly tools that can do that. And those certainly have their own limitations.

But again, I don't know what the pipeline is that nbsphinx uses so I can't be sure.

nbsphinx is not doing that because that's just too far from what Sphinx normally does.

But you can of course take the HTML output of Sphinx (including the stuff from nbsphinx) and convert that yourself to LaTeX, if you prefer.

Those extra lines inside divs cause the formatting in the Notebook to do the wrong thing.

Please report that to the appropriate issue tracker!

Ultimately, all I'm saying is that my expectation is that embedded HTML in a Notebook's Markdown Cell will be faithfully rendered in the PDF, within the limits of translation,

As you say, there are limits in every toolchain. Certainly many things can be improved, but there are also fundamental limitations.

I've tried to work around the mentioned limitations by supporting the alert-styled <div>s I've mentioned above.

and that I should not necessarily have to muck up one rendering to make another successful.

Yes, that should be the goal, but sometimes this is not plausibly achievable with a given set of tools.

It doesn't and I don't know why and I don't know if it can be fixed or if there's a workaround. And, again, this isn't just divs. Any raw HTML is ignored.

Yes, again, that's known behavior.

If you want to have a PDF that's closer to the HTML appearance, you should use a tool that directly creates PDF from HTML pages, without the LaTeX middleman.

You could try https://github.com/betatim/notebook-as-pdf or you could try the pdfhtml builder of https://github.com/executablebooks/jupyter-book, see https://jupyterbook.org/advanced/pdf.html.

The result looks somewhat like a browser screenshot.

This may or may not be what you want.

If you find other alternatives, please let me know so I can add them to my collection of links: https://nbsphinx.readthedocs.io/en/0.8.0/links.html.

actsasgeek commented 4 years ago

Thanks for your help.

The alerts are not quite what I'm looking for but, knowing the limitations, I was able to make a table work well enough.

Cheers!