jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.01k stars 3.35k forks source link

CJK font hijacks smart quotes when compiling PDF with XeLaTeX #7509

Open goderich opened 3 years ago

goderich commented 3 years ago

Explain the problem. When compiling a markdown document with mixed Latin and CJK text and using --pdf-engine=xelatex with smart quotes enabled, all single and double quotes become full-width CJK quotes. Basically the CJK font hijacks the smart quotes.

From my testing, this happens only with XeLaTeX, and only with smart quotes enabled. This did not happen last year (2020), when I was using pandoc extensively, but I discovered the bug a couple of months ago, and it still persists.

You don't even need to have actual CJK text in the file, simply declaring a CJK font produces the bug.

MWE:

---
CJKmainfont: Noto Sans CJK TC
---
It's a "bug"!

When compiling with pandoc --pdf-engine=xelatex -i bug.md -o bug.pdf, the single quote in it's and the double quotes around bug are noticeably wider (full width CJK quotes, from my experience with them). Compiling with other PDF engines or with --from markdown-smart does not produce these wide quote marks.

Pandoc version? pandoc 2.14.0.2 on Archlinux

jgm commented 3 years ago

Unless you can see a problem with the LaTeX output pandoc is producing, this looks more like a problem with xelatex than with pandoc. Can you post on the tex stack exchange to see if the experts there have any ideas what is causing this, or how to work around it?

goderich commented 3 years ago

@jgm Thank you for the swift reply!

I just tried using pandoc to make a tex file and then manually running xelatex on it. This does not produce the bug. The bug only happens when I try to create a PDF from a markdown file (I haven't tried other input though) with pandoc using the xelatex backend.

I think with this new information it looks like the problem might be with pandoc after all? The whole smart quotes thing was weird, too. If it was xelatex, turning smart quotes off in pandoc shouldn't have had an effect.

jgm commented 3 years ago

In producing a PDF via pandoc, we disable the smart extension when creating the intermediate LaTeX file, to avoid bad ligatures like ?`. That's probably why you're seeing a difference. Try creating the LaTeX file using -t latex-smart.

goderich commented 3 years ago

Ah, yes, that does produce a tex file that gives full-width CJK quotes when compiled.

So this is fully on xelatex then?

goderich commented 3 years ago

Looks like it's indeed xelatex, specifically the xeCJK package. I got a MWE for it:

\documentclass{article}
\usepackage{xeCJK}

\begin{document}
It’s a “bug”!
\end{document}

Compiling this with xelatex has the same buggy output with full-width CJK quotes.

goderich commented 3 years ago

I found this on tex stack exchange: https://tex.stackexchange.com/questions/36878/xecjk-messes-with-punctuation

So according to the top answer, this is a feature. That's a problem for users like me, who need to use mixed English and CJK fonts. Also, that answer is from 10 years ago, but I didn't experience this problem last year. So this appears to be a relatively recent change in pandoc?

jgm commented 3 years ago

Yes, relatively recent to disable smart in producing the intermediate tex file. You can of course work around this by generating the tex yourself and compiling. (or try with lualatex?)

goderich commented 3 years ago

Is this a “wontfix” then?

jgm commented 3 years ago

The only possible fix I can think of is for the PDF module to check to see if the metadata or variables includes CJKmainfont ; and if so, use smart in generating the intermediate tex. That re-introduces a risk of weird ligature collisions, but since these mostly come from settings for European language, maybe it's okay.

goderich commented 3 years ago

Just tried lualatex and the bug is still there.

goderich commented 3 years ago

I'm not sure what ligatures you mean exactly. Is it stuff like {\"a}? Do people use those with xelatex? I thought the whole point of xelatex was to let people write unicode directly.

jgm commented 3 years ago

Ligatures like `` ... '' for double quotes, -- for n-dash, etc. Yes, currently we do NOT use these with xelatex in producing a PDF -- that's why we disable smart for LaTeX. But the stackexchange link above says that the recommended way to work around this issue is to use `` ... '' for quotes, instead of unicode curly quotes, which will automatically be interpreted as CJK.

goderich commented 3 years ago

I'm sorry, but I didn't quite understand whether you plan to address this.

What would be the issue with using `` ... '' instead of curly quotes? (Do they introduce unwanted ligatures?) Can it be triggered only when using xeCJK, or CJKmainfont as you suggested?

jgm commented 3 years ago

I note a possible change to pandoc above, with a possible disadvantage it would have. The reason we don't use the `` ligatures by default in generating PDFs is that the language support in babel/polyglossia tends to define language-specific ligatures (I can't remember them all, but stuff like `? that interact badly with these. (You can search this tracker for examples, e.g. https://github.com/jgm/pandoc/issues/4695.)

So, if we use smart in generating the PDF when CJKMainFont is used, there's potential for issues of this kind, if the western language used is one of the ones that use these ligatures.

Probably it's worth doing, which is why I haven't closed this.

jgm commented 3 years ago

But what you should do in the mean time is simply generate a standalone tex file (with -t latex+smart -s) and compile it yourself.

goderich commented 3 years ago

OK, I understand now, thank you.

jgm commented 3 years ago

I'm not able to reproduce this with my tex setup.

Oddly, I can reproduce the issue with your pure latex case. But when I use pandoc -o my.pdf --pdf-engine=xelatex and specify a CJKmainfont as in your example, the quotes look fine! I can't understand why. The intermediate tex file has curly unicode quotes, not ligatures.

jgm commented 3 years ago

Anyway, here's a patch that disables smart in producing LaTeX only if CJKmainfont isn't specified:

diff --git a/src/Text/Pandoc/PDF.hs b/src/Text/Pandoc/PDF.hs
index 9ff4bfb09..4c0514e34 100644
--- a/src/Text/Pandoc/PDF.hs
+++ b/src/Text/Pandoc/PDF.hs
@@ -24,7 +24,7 @@ import qualified Data.ByteString as BS
 import Data.ByteString.Lazy (ByteString)
 import qualified Data.ByteString.Lazy as BL
 import qualified Data.ByteString.Lazy.Char8 as BC
-import Data.Maybe (fromMaybe)
+import Data.Maybe (fromMaybe, isJust)
 import Data.Text (Text)
 import qualified Data.Text as T
 import qualified Data.Text.Lazy as TL
@@ -51,6 +51,7 @@ import Text.Pandoc.Shared (inDirectory, stringify, tshow)
 import qualified Text.Pandoc.UTF8 as UTF8
 import Text.Pandoc.Walk (walkM)
 import Text.Pandoc.Writers.Shared (getField, metaToContext)
+import Text.DocTemplates (lookupContext)
 import Control.Monad.Catch (MonadMask)
 #ifdef _WINDOWS
 import Data.List (intercalate)
@@ -97,10 +98,16 @@ makePDF program pdfargs writer opts doc =
 #else
         let tmpdir = tmpdir'
 #endif
-        doc' <- handleImages opts tmpdir doc
+        doc'@(Pandoc meta _) <- handleImages opts tmpdir doc
+        let cjk = -- see #7509, #7535
+                  isJust (lookupMeta "CJKmainFont" meta) ||
+                  isJust (lookupContext "CJKmainFont" (writerVariables opts)
+                            :: Maybe Text)
         source <- writer opts{ writerExtensions = -- disable use of quote
                                   -- ligatures to avoid bad ligatures like ?`
-                                  disableExtension Ext_smart
+                                  (if cjk
+                                      then id
+                                      else disableExtension Ext_smart)
                                    (writerExtensions opts) } doc'
         case baseProg of
           "context" -> context2pdf program pdfargs tmpdir source

Since I can't reproduce the issue yet, I'm a bit reluctant to apply this.

goderich commented 3 years ago

I'm not able to reproduce this with my tex setup.

Oddly, I can reproduce the issue with your pure latex case. But when I use pandoc -o my.pdf --pdf-engine=xelatex and specify a CJKmainfont as in your example, the quotes look fine! I can't understand why. The intermediate tex file has curly unicode quotes, not ligatures.

Huh. Well that's weird. Thank you for looking into this. I'm using a prepackaged binary on my distribution, and right now as a workaround I'm using a Makefile which generates latex+smart with pandoc, and then compiles the result using xelatex.

I'm afraid I don't have enough time at the moment to explore this further, but I should be able to help debug this further in a couple of months. I'm running Archlinux on both my machines though, so I can't test with other systems.

khemarato commented 2 years ago

I'm also seeing this bug. Frustratingly, I can't seem to compile my document using the lualatex engine either, as it can't find my CJK font (it's removing the spaces from the font name and then saying it can't find that?) 😞

For those reading this later, you can turn off the smart extension using the --from=markdown-smart flag

oldjove commented 2 years ago

I also experienced this bug. After some searching around, I found this solution, whereby you reassign the class of the offending characters. I've added this code to my template and find that it's solved the problem:

\AtBeginDocument{% \XeTeXcharclass^^^^2026=0 \XeTeXcharclass^^^^2019=0 \XeTeXcharclass^^^^2013=0 \XeTeXcharclass“=0 \XeTeXcharclass”=0 \XeTeXcharclass‘=0 }