jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.69k stars 3.39k forks source link

strange results for quote marks in pdf metadata #5812

Closed brainchild0 closed 1 year ago

brainchild0 commented 5 years ago

Consider the following command:

pandoc -o x.pdf <<< '---
title: |+
  "One" -- "Two" --- "Three"
---'

The result is a simple document with a title formatted with curved quote marks, and an en- and em-dash:

md-title

However, the effect in the PDF metadata is less pleasant:

$ pdfinfo x.pdf 
Title:          ``One'' – ``Two'' — ``Three''
Subject:        
Keywords:       
Author:         
Creator:        LaTeX via pandoc
Producer:       pdfTeX-1.40.18
CreationDate:   Fri Oct 11 09:38:00 2019 EDT
ModDate:        Fri Oct 11 09:38:00 2019 EDT
Tagged:         no
UserProperties: no
Suspects:       no
Form:           none
JavaScript:     no
Pages:          1
Encrypted:      no
Page size:      612 x 792 pts (letter)
Page rot:       0
File size:      39795 bytes
Optimized:      no
PDF version:    1.5

The dashes were translated nicely, but the quotation marks are handled strangely. What are the possibilities for creating a plain string that resembles the printed title as cleanly as possible?

agusmba commented 5 years ago

I wouldn't say strangely since that is the standard notation for left and right quotation marks in latex. Whether in this particular case the quotes should be removed is a different matter.

jgm commented 5 years ago

We use \texorpdfstring to ensure that regular tex commands don't go into the PDF bookmarks. It seems that the usual quotation ligatures also don't work in this context. You may find that if you use -t latex-smart --pdf-engine=xelatex, it works properly. In this case pandoc won't use ligatures (because -smart) and the unicode quotes should be passed through unchanged. I don't know if a change to the defaults is called for, because without xelatex using unicode quotes may not work.

brainchild0 commented 5 years ago

Why would use of Unicode be dependent on a particular LaTeX engine? Are other engines unable to support characters outside of ASCII? Do non-English languages lack support in all engines but one?

Assuming the Unicode characters are not presenting a particular issue, would it not be more likely to produce desired results if normal translation by the smart extension, in contrast to the special LaTeX behavior, were applied to the metadata fields so as to generate the correct plain text string without LaTeX ligatures?

In other words, from the manual:

In LaTeX, smart means to use the standard TeX ligatures for quotation marks

It simply seems that metadata might be a special case for this rule.

Would this create any problems other than the possibility that the engine cannot properly handle a Unicode string? And in any case, could basic ASCII quotation marks be used?

jgm commented 5 years ago

Are other engines unable to support characters outside of ASCII?

Correct. pdflatex doesn't support non-ASCII well. xelatex and lualatex do.

Did you try the fix I suggested?

brainchild0 commented 5 years ago

Yes, with smart disabled, the document appearance seems the same, and the metadata looks correct. Both pdflatex and xelatex seem to work equally well.

But I am unsure of the penalties of disabling smart. It seems like the correct choice given that I write MarkDown using these conventions.

But more to the point of the issue, would it not be an improvement if handling occurred correctly even with the extension enabled, even if in some cases it would mean using only basic ASCII quotation marks?

jgm commented 5 years ago

No penalties disabling smart on latex output if you're just producing pdf with xelatex or lualatex.

We can leave this open with the suggestion of using ASCII quotation marks, but I'm not sure it's worth the additional code complexity.

brainchild0 commented 5 years ago

Then maybe smart should be disabled for LaTeX, if it has no benefit and some liability.

By the way, is there an error case for using the Unicode string in pdflatex? It worked fine for me just now.

TomBener commented 1 year ago

Disabling the smart option for LaTeX may be not a good option. For straight quotes in headings, it would be great to wrap them with \texorpdfstring.

For example, converting:

\section{Pandoc's Features}\label{pandocs-features.md__pandocs-features}

to:

\section{\texorpdfstring{Pandoc's Features}{Pandoc’s Features}}\label{pandocs-features.md__pandocs-features}

5909 is related to the issue.

jgm commented 1 year ago

I think the original issue has long ago been solved. Here's the result with current pandoc:

% pdfinfo x.pdf 
Title:           “One” – “Two” — “Three”

Thus, closing...

TomBener commented 1 year ago

@jgm Wait, quotes in the headings are not processed correctly. If writing the heading in Markdown:

# "One"

Then converting to PDF via LaTeX, the PDF bookmark is still ``One'' instead of the desired “One”.

jgm commented 1 year ago

@TomBener I'm not seeing this. You may be using an old version of pandoc? (Or older tex packages?)

TomBener commented 1 year ago

@jgm You're correct. But I found a weird result. Let me clarify.

The content of the markdown file named test.md are as follows:

# "One" Heading

Some texts here.

# Pandoc's Features

Then if I run the command:

pandoc --pdf-engine=xelatex test.md -o test.pdf

The generated PDF test.pdf had the correct bookmark.

CleanShot 2023-08-22 at 10 38 43@2x

However, if I cut them to two steps, e.g. firstly generate LaTeX via Pandoc:

pandoc -s test.md -o test.tex

Then compile test.tex to PDF manually:

xelatex test.tex

Then the generated PDF bookmark was not desired.

CleanShot 2023-08-22 at 10 38 20@2x

The Pandoc version:

$ pandoc --version
pandoc 3.1.6.1
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /Users/username/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

For some reason, I need to generate LaTeX and then compile it to PDF, so the difference is important for the workflow. Could you help me with the issue? Thanks a lot.

jgm commented 1 year ago

I think this is because in generating PDF via latex, we disable the smart extension in writing the LaTeX. You could try with -t latex-smart.

TomBener commented 1 year ago

Disabling the smart extension could be an option. However, when writing Chinese, the side effects emerged. Like the screenshot shows below, the English quotes were also treated as Chinese, which looked quite wide.

CleanShot 2023-08-22 at 16 06 58@2x

To generate the PDF above, the command below was executed:

pandoc --pdf-engine=xelatex -V CJKmainfont=NotoSerifCJKsc-Regular test.md -o test.pdf

Even if I loaded \usepackage[punct=plain]{ctex}, the issue remained.

All problems lie in that Chinese and English use the same quotes in the Unicode table. In the Chinese LaTeX forum, it is recommended to write quotes as follows:

``English Quotes''

“中文引号”

Indeed, this is an annoying problem. I don’t expect pandoc can make changes for it, but just propose the issue.