jupyter / nbconvert

Jupyter Notebook Conversion
https://nbconvert.readthedocs.io/
BSD 3-Clause "New" or "Revised" License
1.71k stars 563 forks source link

pdf output misses unicode and combined inside code cells #533

Open Ken-B opened 7 years ago

Ken-B commented 7 years ago

I develop in Julia which allows unicode and combined characters in code. However, they are missing when converting to pdf (html output works fine).

To replicate, create a new (Julia) notebook and inside a code cell type (using tab completion or just copy) for example

a = 1
â = 2
β = 3
β̂ = 4

(it seems even github has problems with the last one).

The pdf outputs: a=1 a=2 =3 =4

So the hat is removed on the second line and the characters disappear completely in the last two lines.

As a workaround I first convert to html and then print to pdf, but then all the beautiful pdf layout is missing. I'm on nbconvert 5.1.1 (anaconda distribution) on mac osx 10.12.3. Let me know how I can help, thanks.

takluyver commented 7 years ago

I wonder if the font used in Latex for code doesn't have symbols for those characters.

mpacer commented 7 years ago

My guess is that it's a font issue, but there are a couple of other possibilities that I want to think through first:

How LaTeX handles combining characters like ̂ in math mode (at least traditionally) is using Unicode, and while this isn't a mathmode problem, I want to make sure that a couple of things work. Do you run into this issue when you use LaTeX in a markdown cell and directly invoke $\hat{\beta}$? Do your characters work when placed into a plain markdown cell? Do they work if you put the Unicode characters directly into a mathmode (e.g., $β̂ = 4$)? Do they work if you put them inside a literal Verbatim block?

Depending on how all that goes we will have an idea of how to move forward.

Also, the Latin characters that are displaying successfully are in the same Unicode plane so it shouldn't be that kind of Unicode problem. I think instead it might be that the monospace font we're using in Verbatim doesn't include these Unicode characters.

Ken-B commented 7 years ago

when you use LaTeX in a markdown cell and directly invoke $\hat{\beta}$

works, outputs characters

into a plain markdown cell

doesn't work, same behaviour as in code cell, characters dropped

Unicode characters directly into a mathmode (e.g., $β̂ = 4$)

doesn't work, same behaviour as in code cell, characters dropped

inside a literal Verbatim block

doesn't work, same behaviour as in code cell, characters dropped

Are you able to reproduce yourself or do you think it's a local issue on my end? I'm more than happy to assist further.

mpacer commented 7 years ago

Ok, so this looks like its not specific to the Julia notebook either. I'm able to reproduce in a python notebook, just with markdown cells. I've spent a little time looking into it and I know that if I use a class that sets the roman font to one that I know has decent unicode support (minion pro) I get (mostly) expected behaviour back.

I will mention that even with Adobe's minion pro, it's not combining the β and thê correctly, and that is a font with a lot of features, so we may not be able to accommodate this easily.

That said, we should definitely have support for non-combining unicode characters, so I'm going to spend time figuring this out.

mpacer commented 7 years ago

Ok, so I've partially traced this down to the use of the mathpazo package without explicitly invoking Palatino (or any particular font) as the mainfont (i.e., via \setmainfont{}, along with not using fontspec).

However, it still doesn't address the unicode inside math issue, which I know I fixed for my dissertation (and not in a completely hacky way). But I'll make a PR with the first bit of a fix.

Question: @takluyver I'm not able to find anywhere the list of fonts that are going to be installed with a default TeXLive install…I had looked this up for a few hours a couple weeks back too and I'm running into the same dead ends. Should we look into including a font as part of nbconvert itself?

mpacer commented 7 years ago

Some further updates: we need to include fontspec to get the combining character support (in general). Using the sourcecodepro package allows us to get these characters supported, but does not correctly position the circumflex above the β (much like in many places online, such as GitHub (β̂)).

However, I fear weirdness coming from using so many different font declaration mechanisms and would prefer to just use fontspec for everything if we're going to be using it at all.

I made an post on the tex stackexchange, I'll keep you updated as I learn more.

takluyver commented 7 years ago

Question: @takluyver I'm not able to find anywhere the list of fonts that are going to be installed with a default TeXLive install…I had looked this up for a few hours a couple weeks back too and I'm running into the same dead ends. Should we look into including a font as part of nbconvert itself?

I may answer this question with several of my own, as I don't know much about Latex.

What kind of font files does Latex use? Are they the same across different platforms and different Tex distributions? How big would one be that includes all of the characters people are likely to want (which is probably all the characters)? What's required to tell it about a font outside of the normal fonts location? Is this the same across different platforms and Tex distributions?

If we really can't rely on a sufficiently complete default font being available to Latex, then I think it makes sense to look at including one. But I'm somewhat reluctant - if we start shipping anything that a Latex installation might be missing, then we risk become a Tex distribution ourselves.

mpacer commented 7 years ago

What kind of font files does Latex use?

By LaTeX, i'm guessing you mean one of the engines for outputting to pdf (and not just postscript), so pdfLaTeX, XeLaTeX, LuaLaTeX (&c.). I am pretty sure all can handle metafont (.mf), but the odds of someone using an .mf are low. Otherwise, I think that pdfLaTeX can handle TrueType and PostScript type 1 fonts if you first generate tex font metrics(.tfm) for them. But we've moved beyond that and now use XeLaTeX, which means we can use almost any font file we want, which especially includes OpenType formatted font files (.otf). I don't think that anything can handle reading a Web open font formatted font (.woff) but that is more or less a wrapper around TrueType or OpenType formatted font information (with additional metadata), so I'm guessing that that could work if the font format file was accessible, and even if it doesn't right now that it would be possible without an insane amount of finagling.

Are they the same across different platforms and different Tex distributions?

Font formats? They should be more or less universal. There are exceptions having to do with bitmap fonts (which people shouldn't be using) or Apple Advanced Typography fonts(.aat) but those are more or less out of date (having been superseded by OpenType years ago).

How big would one be that includes all of the characters people are likely to want (which is probably all the characters)? I don't think that there is a font in existence that contains "all the characters".

What's required to tell it about a font outside of the normal fonts location? Is this the same across different platforms and Tex distributions?

Because we're using XeLaTeX it should be able to find any fonts available from system locations (this can actually cause problems in some cases, see that tex.stackexchange post for an example with regards to Source Code Pro). Thus if we could install a font as part of nbconvert to the system location, we should be able to always find it.

If we really can't rely on a sufficiently complete default font being available to Latex, then I think it makes sense to look at including one. But I'm somewhat reluctant - if we start shipping anything that a Latex installation might be missing, then we risk become a Tex distribution ourselves.

I think we could safely distribute the font without needing to distribute TeX, especially since we're using XeLaTeX which means that we should be able to just install it in the system location and point out that people cannot use the font unless they are using XeLaTeX for export or have some other means of linking to the font.

takluyver commented 7 years ago

Thus if we could install a font as part of nbconvert to the system location, we should be able to always find it.

Could be tricky. If nbconvert is installed into an environment, it can't modify anything outside that environment. On Linux, we might be able to set $XDG_DATA_DIRS to affect where it looks for fonts (if xelatex uses that to find them). I don't know if there's an equivalent on other platforms. We could install it when nbconvert is run, but randomly installing stuff which affects the rest of the system is generally a bad idea.

msto commented 5 years ago

Hi,

I'm encountering this issue as well. (Jupyter v4.4.0, nbconvert v5.3.1). Is there a suggested fix or workaround?

Thanks!

kmundnic commented 4 years ago

One workaround is to use XeLatex and set the mono font with one that supports unicode, as suggested here: https://tex.stackexchange.com/questions/264390/xelatex-unicode-symbols-do-not-show-up-in-verbatim

ibehnam commented 2 years ago

This issue still exists! Any work on that by the devs?

parsiad commented 2 years ago

I can confirm that the workaround suggested by @kmundnic fixes the issue. Adding a bit more detail to their answer, you will have to do the following:

  1. Create a custom template and that includes \setmonofont{DejaVu Sans Mono}. The easiest way to do this is to copy an existing template and edit it.
  2. Compile with jupyter nbconvert --template <YOUR_TEMPLATE> ...