UTF-8 problem during converting to PDF

theasder commented 6 years ago

Hey there,

I have jupyter notebook from anaconda on Ubuntu 16.04 with installed xetex. I tried to convert it to PDF and it fetched successfully all english words and formulas, but there were some utf-8 symbols and it ignored it. Some logs here:

[W 18:37:28.242 NotebookApp] Notebook ЮраМолодец.ipynb is not trusted
[I 18:37:32.967 NotebookApp] Starting buffering for 005c099d-5a37-410e-812a-9b062b5744fc:e6a21be8811347f892517ca798014ee1
[I 18:37:33.290 NotebookApp] Kernel restarted: 005c099d-5a37-410e-812a-9b062b5744fc
[I 18:37:35.346 NotebookApp] Adapting to protocol v5.1 for kernel 005c099d-5a37-410e-812a-9b062b5744fc
[I 18:37:35.346 NotebookApp] Restoring connection for 005c099d-5a37-410e-812a-9b062b5744fc:e6a21be8811347f892517ca798014ee1
[I 18:37:35.347 NotebookApp] Replaying 6 buffered messages
[W 18:37:42.269 NotebookApp] Notebook ЮраМолодец.ipynb is not trusted
[I 18:37:43.054 NotebookApp] Support files will be in 
[I 18:37:43.054 NotebookApp] Making directory /root
[I 18:37:43.055 NotebookApp] Making directory /root
[I 18:37:43.055 NotebookApp] Making directory /root
[I 18:37:43.055 NotebookApp] Making directory /root
[I 18:37:43.056 NotebookApp] Writing 30590 bytes to /root/notebook.tex
[I 18:37:43.056 NotebookApp] Building PDF
[I 18:37:43.056 NotebookApp] Running xelatex 3 times: ['xelatex', '/root/notebook.tex']
[I 18:37:47.278 NotebookApp] Running bibtex 1 time: ['bibtex', '/root/notebook']
[W 18:37:47.302 NotebookApp] bibtex had problems, most likely because there were no citations
[I 18:37:47.303 NotebookApp] PDF successfully created

It generated valid tex file, but no multilanguage support in it.

hycakir commented 6 years ago

See my answer here: https://stackoverflow.com/a/49582428/2372611

The problem is Jupyter uses xelatex command to compile latex (to support Unicode, I think). But the problem is there is no need for xelatex for the generated file, it can be directly compiled with latex or pdflatex with Unicode support. I think the file generated does not have the configurations needed for xelatex to evaluate Unicode characters.

QGB commented 6 years ago

jupyter unicode convert pdf

t-makaro commented 6 years ago

Can someone please produce a minimum example notebook.ipynb and/or provide a copy of the latex output from:

Jupyter nbconvert --to latex notebook.ipynb

If I have a file to work with, then I can investigate this.

t-makaro commented 6 years ago

I believe this is relevant. The April 2018 release of LaTeX defaults to utf-8 encoding.

Also relevant.

If I can get a file and replicate the issue, I may be able to solve this.

frederik-elwert commented 5 years ago

I invensigated the problem a bit, and the main issue seems not to stem from the fact that the UTF8 is not correctly recognized. The actual problem is that the main font does not have the corresponding glyphs.

Jupyter uses the mathpazo package to load URW Palladio. But that font does not cover many scripts. Using DejaVu Sans instead, which covers a wide range of unicode scripts, fixed the problem for me (still not covering cases like RTL languages, but that’s another problem).

The problem is that DejaVu Sans is not exactly pretty, and this would affect all documents, even those who don’t use non-latin scripts.

A possible solution seems to be the ucharclasses package. That allows to define separate fonts for different unicode blocks. That way, the main (latin) font could be left as it is, only specifying fallback fonts for other scripts.

The Noto fonts might be a viable set of fonts for non-latin blocks.

t-makaro commented 5 years ago

I just spent some time exploring ucharclass, and I believe that I can make this work.

If I added the following:

\usepackage[Latin,Greek]{ucharclasses}
\usepackage{fontspec}   

\newfontfamily{\mynormal}{Latin Modern Roman}
\setDefaultTransitions{\mynormal}{}
\newfontfamily{\mygreek}{Courier New} 
\setTransitionsForGreek{\mygreek}{}

to the bottom of the preamble (It messes with section titles if I put \usepackage{fontspec} any earlier), then symbols like θα work properly. I see no reason why this wouldn't work for other Unicode blocks. We just need to agree on fonts for the different blocks. I would also like to figure out how to store the default font instead of overriding it.

This will also only work in XeLaTeX, so it would be smart to wrap this is some

\ifdefined\XeLaTeXonlycommand
...
\fi

This way it is still possible to compile the latex file using pdflatex.

CC @mpacer

t-makaro commented 5 years ago

I just noticed an issue with this solution. \setDefaultTransitions{\mynormal}{} will change to a non-monospaced font for any latin characters including inside cell inputs/outputs. This could be changed to \setDefaultTransitions{\ifcell\somemonofont\else\mynormal\fi}{}, but then every single verbatim environment needs to be wrapped with \celltrue … \cellfalse where cell is defined by \newif\ifcell.

jpgoldberg commented 2 years ago

When using xelatex (or lualatex) the preamble correctly loads the unicode-math package. But that is the only font setting it has. The OpenType fonts loaded by unicode-math are good for the math part, but are limited in other respects. In particular Latin Modern Mono does not include Greek or Cyrillic.

If we replace \usepackage{unicode-math} with \usepackage[default]{fontsetup} we get all the goodness of unicode-math (because fontsetup loads unicode-math), but we get the full cm-unicode fonts for all of the text (including monospaced) which includes Greek and Cyrillic.

See my StackExchange answer for more detail.

jupyter / nbconvert

UTF-8 problem during converting to PDF #786