Maybe use xelatex instead of pdflatex by default

JulienPalard commented 7 years ago

Subject: I built the cpython documentation in french and japanese, and found it non-trivial to find the right set of options.

Problem

Given that:

Non-ascii characters are more and more used, even in english (see the Є in https://docs.python.org/3.7/whatsnew/3.7.html#optimizations)
Documentations are sometimes written in other languages like Japanese, French, and so on.
Documentations are sometimes translated from one language to another.

We could expect to find non-ascii characters everywhere, which are badly supported by pdflatex, even by using utf8x which come with another set of issues.

Proposed solution

I finally found that xelatex handle very well unicode characters, but does not work well with japanese. And platex works well with japanese.

platex is already the default with the latex_engine is not explicitly configured, which is already nice, but there is no way to configure xelatex for all languages and platex for japanese (https://github.com/sphinx-doc/sphinx/issues/4150).

It forces everyone to learn a lot about latex and PDF generation, and finally force them to use -D with external logic to switch between working engines like https://github.com/python/docsbuild-scripts/pull/34/files.

Also, the documentation is not very explicit about the usages of those engines (see https://github.com/sphinx-doc/sphinx/issues/4149).

What I propose is to switch the default from 'pdflatex' if language != 'ja' else 'platex' to 'xelatex' if language != 'ja' else 'platex' which is a combination that works without any other modification to build cpython documentation in english, french, and japanese.

tk0miya commented 7 years ago

@jfbu could you give us comments for this please?

I can't determine it's better or not. I don't know which is good successor of PDFLatex, LuaTeX and XeTeX. And Also don't know they are enough stable or not for usage of Sphinx.

Of course, I will agree to change default latex engine if either one is enough stable.

jfbu commented 7 years ago

I will make general remarks.

LuaLaTeX is actively maintained and will probably offer more and more features via dedicated packages which achieve things currently impossible in TeX. But these advantages are probably not needed by vast majority of Sphinx projects. Besides, it appears that ̀LuaLaTeX opens up new security concerns in TeX world, due to scripting language Lua not having the restrictions for file opening and writing which exist with pdfTeX binaries as distributed with major installations (TeXLive and MikTeX). It may be said that such concerns already exist from running Python scripts,... nevertheless this makes one think twice before adopting it on grand scale by default. Thus, I would not recommend using lualatex by default, before experience has accumulated elsewhere.

XeLaTeX does not have this issue.

but switching to it does not solve all Unicode related problems: in fact, in hand-written documents authors manually switch languages according to needed glyphs, and they set-up appropriate fonts for languages (at least indirectly via polyglossia package)
so far Sphinx LaTeX writer does not support multi-lingual documents. Even if it did, author of Sphinx project would need to manually add mark-up to source in case of exotic Unicode characters to signal the change of language, hence possible OpenType font to use: fonts do not support all scripts, although indeed some fonts do support a wide range of scripts.
it appears that polyglossia support for French lags behind babel+french features, so if we at Sphinx set usage of ̀xelatex+polyglossia default, we may raise specific French issues -- admittedly they may be relevant only to expert LaTeX users which will know how to switch back to babel+french usage.
they are issues with xelatex regarding math mode: it has some currently non-fixed bugs there, but this is arguably not a very strong deterrent for Sphinx projects.

Making xelatex default will modify looks of all Sphinx produced PDFs, because xelatex should be used with OpenType fonts. It can be used with traditional TeX fonts, but then hyphenation mechanism of TeX is broken in some languages. Recently the LaTeX team has modified behaviour of LaTeX so that by default if used with xelatex engine it will use OpenType version of lmodern font.

So making xelatex default also requires reviewing font configuration and all Sphinx supported languages and as I said it will change the default looks of all Sphinx build PDF documentations.

This looks like quite some work at Sphinx side... I think first step is to move Sphinx towards supporting multi-lingual documents. Because making xelatex default engine is not by itself a 100% solution to all problems related to Unicode input. It requires extra steps.

jfbu commented 7 years ago

One last pros and cons:

typically xelatex produced PDFs are smaller than pdflatex produced ones, when using traditional TeX fonts, because xelatex better compresses the font; but as explained already, xelatex should not be used with traditional TeX fonts for optimal results,
compilation times with xelatex or lualatex are often significantly increased compared to pdflatex builds.

JulienPalard commented 7 years ago

That's a lot to consider and I'm no latex expert. I just noticed that the current default (pdflatex/platex) put me in a hard situation when building english, french, and japanese:

With default configuration, only japanese succeed (platex)
Adding utf8x fix english build, (don't remember if it break platex-japanese)
Building with xelatex (and removing utf8x) to try to fix french breaks japanese (no more platex by default for japanese)

So I just can't have a successful build with conf.py, I have to use sphinx-build -D flags to pass the right latex_engine for the right language, with an external logic.

It took me some time to find the "right combination", which looks in fact really simple, just replace pdflatex with xelatex as a default engine but keep the "default to platex for japanese if default engine is used".

In one hand I may be short sighted as I tested a single project, in the other hand the Python documentation is huge (230k lines of rst).

fyears commented 7 years ago

+1 to xelatex

I can confidently say that most Chinese LaTeX users prefer xelatex to pdflatex nowadays, because xelatex has MUCH better support for opentype fonts, thus Chinese uses find it WAY MORE easier to display Chinese characters in the generated pdf. The same technology applies to Japanese and Korean characters too (we often refer their fonts together as CJKfonts).

@jfbu In my understanding, sphinx-doc maintains its default template of pdf, thus something like “front issue” should not be a problem (to users)?

@JulienPalard switching from pdflatex to xelatex for JP doc is not THAT trivial. At least you should set \setCJKmainfont , otherwise JP characters are not expected to be displayed correctly. Still, it’s kind of easy for simple cases, see https://tex.stackexchange.com/questions/139081/cjk-blank-output-for-japanese-characters

fyears commented 7 years ago

Some more helpful info here:

xelatex is stable enough to use. luatex is not as popular as xelatex for Chinese users.
Refer https://www.sharelatex.com/learn/Japanese (also, check pages for Chinese and Korean). the xetex packages is universal for CJK environments, if we only need to display some characters and not consider complicated locales (e.g. how dates are rendered). One true issue to be considered, is how to determine the \setCJKmainfont for different systems (win/linux/osx maintain different fonts!) and different languages (sorry but people in CJK develop different fonts).

jfbu commented 7 years ago

There is no notion of seamless experience in LaTeX regarding Unicode, although xelatex and lualatex have considerably improved the situation.

Already, Sphinx does the minimal right thing regarding xelatex which is not to use inputenc nor fontenc. With a recent LaTeX this means it will automatically use the Latin Modern OpenType font which has good coverage of European (in the large sense) languages.

$ otfinfo -s lmroman10-regular.otf
DFLT        Default
cyrl        Cyrillic
latn        Latin
latn.AZE    Latin/Azeri
latn.CRT    Latin/Crimean Tatar
latn.MOL    Latin/Moldavian
latn.NLD    Latin/Dutch
latn.PLK    Latin/Polish
latn.ROM    Latin/Romanian
latn.TRK    Latin/Turkish

It has no coverage for Chinese or Hebrew for example. This means Sphinx user for a project in these languages must customize LaTeX preamble to appropriately use \setmainfont (or \setCJKmainfont as documented by @fyears) to pick suitable font (Sphinx loads fontspec which provides this macro; but xelatex has its own font loading primitives which advanced xelatex users use directly; normal users will use fontspec and they will have had to read partly its documentation; does this include the average Sphinx-doc user?).

The way this is done is system dependent regarding fonts which are provided with TeX itself (and on Mac OS X one must use different methods depending on whether the OpenType font is a system/user font or in the TeX tree).

Even the minimal Sphinx set-up for xelatex contains elements which are not satisfactory: the coverage of French language by polyglossia is far more restricted than what the babel-frenchb module provides: with polyglossia there is no conformity regarding footnotes and lists with the French typographical rules.

Besides, latex-babel is now (after some years of stagnation) actively maintained and being developed in direction of xelatex/lualatex support. As a result it is not clear if polyglossia will remain preferable to babel in future.

Regarding French as I said it is not. Sphinx French user of xelatex is now well advised to modify latex_elements 'babel''s key to set it to '\usepackage{babel}'. Sphinx internally has 'polyglossia' but will obey 'babel' key if the user has set it:

        # set up multilingual module...
        # 'babel' key is public and user setting must be obeyed
        if self.elements['babel']:
            # this branch is not taken for xelatex/lualatex if default settings

Making xelatex default makes no sense if reasonable font defaults for all Sphinx covered languages are not provided.

For example, similarly as we have specific coverage of japanese [1]_, we can provide specific coverage of Chinese if consensus emerges on how to best set-it up with XeLaTeX and this must be done Windows, Mac OS X, Unixen... Contributions are most welcome !

.. [1] which as mentioned already in this thread goes currently via platex engine which does not support Unicode.

And, stressing again, this does not solve problems one may encounter with stray Unicode characters !

jfbu commented 7 years ago

Here is basic test of Hebrew with xelatex:

\documentclass[hebrew]{article}
\usepackage{polyglossia}
\setmainlanguage{hebrew}
\begin{document}
מבוא
\end{document}

Produces errors:

./testhebrew.tex:4: Package polyglossia Error: The current roman font does not 
contain the Hebrew script!
(polyglossia)                Please define \hebrewfont with \newfontfamily.

See the polyglossia package documentation for explanation.
Type  H <return>  for immediate help.
 ...                                              

l.4 \begin{document}

(That was another \errmessage.)

Missing character: There is no מ in font [lmroman10-regular]:mapping=tex-text;!
Missing character: There is no ב in font [lmroman10-regular]:mapping=tex-text;!
Missing character: There is no ו in font [lmroman10-regular]:mapping=tex-text;!
Missing character: There is no א in font [lmroman10-regular]:mapping=tex-text;!

./testhebrew.tex:6: Package polyglossia Error: The current roman font does not 
contain the Hebrew script!
(polyglossia)                Please define \hebrewfont with \newfontfamily.

See the polyglossia package documentation for explanation.

Attempting to try Sphinx on minimal Hebrew document with xelatex leads to plenty of problems:

.. FOO documentation master file, created by
   sphinx-quickstart on Sat Oct 21 14:57:01 2017.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.

תוכן הענייני
============

רשימת הטבלאות

in conf.py:

language = 'he'
latex_engine = 'xelatex'

Package bidi Error: Oops! you have loaded package xcolor after bidi package. Please load package xcolor before bidi package, and then try to run xelatex on your document again.

Package bidi Error: Oops! you have loaded package float after bidi package. Please load package float before bidi package, and then try to run xelatex on your document again.

Package bidi Error: Oops! you have loaded package framed after bidi package. Please load package framed before bidi package, and then try to run xelatex on your document again.

Package bidi Error: Oops! you have loaded package wrapfig after bidi package. Please load package wrapfig before bidi package, and then try to run xelatex on your document again.

etc... etc...

and the one of interest to this thread:

Package polyglossia Error: The current roman font does not contain the Hebrew script!
...

(as above)

This confirms Sphinx-doc user will have to know a minimum of LaTeX macros (\newfontfamily) and documentation (fontspec, polyglossia) before reaching usable status for Hebrew language documents even with xelatex as latex_engine.

(we at Sphinx should probably take care of loading polyglossia hence bidi at the right place)

tk0miya commented 7 years ago

jfbu, Thank you for comment.

As you said, moving to xelatex is not silver bullet. AFAIK, there are no common settings that works well for all languages.

@fyears For Chinese docs, #3272 is proposed. It tries to move to xelatex and ctex only if language is zh_*.

tk0miya commented 7 years ago

Note:

This looks like quite some work at Sphinx side... I think first step is to move Sphinx towards supporting multi-lingual documents.

I don't know this is really needed. I've never seen such request. So it's okay to support only one language per project at once.

(edit) oh, I understand #4159 requires it...

jfbu commented 7 years ago

@tk0miya in the case of CPython docs (which is big...), for example French translation is only at 27.2% currently.

It could make sense (not only for PDF perhaps, but for PDF it is important due to hyphenation which depends on language) to have multi-lingual. Currently only portions of CPython's library.pdf (about 1800 pages) are in French but the whole is treated as French document. This means that hyphenation is wrong for all English text, which is vast majority of document.

(I am using make latex SPHINXOPTS="-D locale_dirs=locales -D language='fr' -D gettext_compact=0" to build the CPython French documentation, with Doc/locales/fr/LC_MESSAGES a symlink to the python-docs-fr cloned repo at 3.6 branch)

tk0miya commented 7 years ago

Ah, I understand. Surely, it is mixture of English and French. I feel it is very difficult to support it in Sphinx. We must mark languages per sentences or words...

jfbu commented 7 years ago

@tk0miya But this is done by Docutils already. Consider this test file

Welcome to FOO's documentation!
===============================

Hello

.. class:: language-fr

   Bonjour

.. class:: language-de

   Guten Tag

Again English.

and then rst2latex.py index.rst test.tex constructs a LaTeX file which looks like this (non relevant lines cut):

\documentclass[a4paper]{article}
% generated by Docutils <http://docutils.sourceforge.net/>
\usepackage{cmap} % fix search and cut-and-paste in Acrobat
\usepackage{ifthen}
\usepackage[T1]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[french,ngerman,english]{babel}
% Prevent side-effects if French hyphenation patterns are not loaded:
\frenchbsetup{StandardLayout}
\AtBeginDocument{\selectlanguage{english}\noextrasfrench}

[lines cut]

\begin{document}
\maketitle

Hello

\foreignlanguage{french}{Bonjour}

\foreignlanguage{ngerman}{Guten Tag}

Again English.

\end{document}

On further experiment in case of multiple paragraphs each one is given as argument to \foreignlanguage. It should be probably better with \begin{otherlanguage}{french}...\end{otherlanguage} mark-up.

On the other hand Sphinx make latex produces this kind of output:

Hello

\begin{fulllineitems}
\pysigline{\sphinxbfcode{language-fr}}
Bonjour

Un autre paragraphe

\end{fulllineitems}

\begin{fulllineitems}
\pysigline{\sphinxbfcode{language-de}}
Guten Tag

\end{fulllineitems}

Again English.

Possibly related to #4010

jfbu commented 7 years ago

HTML output from rst2html.py looks like this:

<p>Hello</p>
<p lang="fr">Bonjour</p>
<p lang="fr">Un autre paragraphe</p>
<p lang="de">Guten Tag</p>
<p>Again English.</p>

mitya57 commented 7 years ago

@jfbu: In Sphinx, the .. class:: directive has a different meaning. Use .. rst-class:: language-XY if you want to insert the original Docutils directive. It should work then.

jfbu commented 7 years ago

@mitya57: thanks for the tip, which does work indeed for html target, producing same lang attributes as rst2html.py. But it fails for latex target (as expected from actual writers/latex.py code...); the fulllineitems environments are gone however, the output simply losing all traces of the language tags in reST sources.

jfbu commented 5 years ago

Sphinx 2.0 will use GNU FreeFont with xelatex, providing good coverage of Latin, Cyrillic and Greek scripts (as well as Arabic and Hebrew). This adds new requirement fonts-freefont-otf on Ubuntu xenial or e.g. in Fedora 29 texlive-gnu-freefont. Perhaps Sphinx 3.0 can then have 'xelatex' as default latex_engine, for non-Japanese projects.

(edit: and make suitable choice of fonts for Chinese with 'xelatex')

goyalyashpal commented 8 months ago

french and japanese - @ JulienPalard at https://github.com/sphinx-doc/sphinx/issues/4159#issue-266095659

i use hindi, and same problem will be faced with using any indian language script (Hindi, Nepali, Tamil, Telugu, Pubjabi, Marathi, Gujarati, ...).

compilation times with xelatex or lualatex are often significantly increased compared to pdflatex builds. - @ jfbu at https://github.com/sphinx-doc/sphinx/issues/4159#issuecomment-337268445

that's 'cz xelatex outputs in pdf, and modifying pdf is what takes time. to save on that, latexmk uses xelatex to fastly generate output of intermediate passes in .xdv files; then converts that via xdvipdfmx to .pdf only once at last.

Ref (abridged by me, original at: latexmk-pdf):

  -pdfxe Generate pdf version of document using xelatex [and xdvipdfmx via
         .xdv intermediate files].  Note that production of a .xdv file by
         xelatex is fast, [but of] a .pdf file can be quite time consuming
         when  document includes  large graphics files. So [this approach]
         can result in substantial gains in procesing time, since the .pdf
         file is produced once rather than on every run of xelatex.

Unabridged verbatim:

> -pdfxe Generate pdf version of document using xelatex. Note that to > optimize processing time, latexmk uses xelatex to generate an > .xdv file rather than a pdf file directly. Only after possibly > multiple runs to generate a fully up-to-date .xdv file does la- > texmk then call xdvipdfmx to generate the final .pdf file. > > (Note: The reason why latexmk arranges for xelatex to make an > .xdv file instead of the xelatex's default of a .pdf file is as > follows: When the document includes large graphics files, espe- > cially .png files, the production of a .pdf file can be quite > time consuming, even when the creation of the .xdv file by xela- > tex is fast. So the use of the intermediate .xdv file can re- > sult in substantial gains in procesing time, since the .pdf file > is produced once rather than on every run of xelatex.)

jfbu commented 3 months ago

@goyalyashpal

There is in our docs this tip:

Also, if latexmk is at version 4.52b or higher (January 2017) LATEXMKOPTS="-xelatex" speeds up PDF builds via XeLateX in case of numerous graphics inclusions.

This -xelatex option is (with current Latexmk) equivalent to -pdfxe -dvi- -ps-.

It is probably time in 2017 we do this unconditionally.

sphinx-doc / sphinx

Maybe use xelatex instead of pdflatex by default #4159

Problem

Proposed solution