jgm / pandoc

Universal markup converter
https://pandoc.org
Other
34.7k stars 3.39k forks source link

LaTeX writer incorrectly handles punctuation in underlined text #9006

Open phlummox opened 1 year ago

phlummox commented 1 year ago

Summary

Attempting to convert the following Markdown to PDF (via LaTex) results in an error:

[john's shoes]{.underline}

Likewise for documents containing any of the following Markdown:

Steps to reproduce

  1. Create test.md:

    [john's shoes]{.underline}
  2. Convert it to PDF, via LaTeX:

    pandoc -t latex -o test.pdf test.md

Expected behaviour

A PDF document should be generated.

Actual behaviour

The following output is produced:

Error producing PDF.
! Argument of \UTFviii@three@octets@combine has an extra }.
<inserted text> 
                \par 
l.64 \ul{john’s shoes}

Result of --verbose

If the Pandoc commands above are run with --verbose, it can be seen that Pandoc is generating quite different LaTeX to what it would produce if not asked to create a PDF.

Given the Markdown [john's shoes]{.underline}, the command

$ pandoc -s -t latex -o test.tex test.md

will produce a correct .tex document, containing the code

\ul{john's shoes}

(And similarly for all the other examples reported above.) If asked to generate a PDF, however, then looking at the temporary .tex file shows that it contains

Given that this just isn't the correct way of writing LaTeX, it's not surprising that problems ensue.

It also makes the reason for the bug harder to spot, since the incorrect LaTeX code which Pandoc is actually using differs from the correct LaTeX code it generates when asked to output a .tex file.

Possible corrections to manual

Currently, the Pandoc manual, here --

https://github.com/jgm/pandoc/blob/6067e477acd933316ba23a1838aafffad872f627/MANUAL.txt#L132

states

To debug the PDF creation, it can be useful to look at the intermediate representation: instead of -o test.pdf, use for example -s -o test.tex to output the generated LaTeX. You can then test it with pdflatex test.tex.

But if Pandoc's behaviour when creating PDFs – creating a temporary LaTeX file with "smart" quotes, unicode en-dashes, etc. – is intentional, then this bit of the manual is not correct. It isn't actually useful to look at Pandoc's LaTeX output, because that's not what Pandoc will use internally, and it won't necessarily help you debug pdflatex compilation problems; you should just run with --verbose, instead. So this part of the manual might need to be amended.

Behaviour of version of Pandoc from last year (2022)

In case it's helpful – I happened to have a copy of Pandoc 2.19.2 on my system, downloaded from https://github.com/jgm/pandoc/releases/download/2.19.2/pandoc-2.19.2-linux-amd64.tar.gz last year. That version doesn't exhibit the bug: it correctly generates a PDF.

So this seems to be a regression in behaviour.

Pandoc version

Latest release (3.1.6.1), installed from .deb file downloaded from the "Releases" page.

$ pandoc --version
pandoc 3.1.6.1
Features: +server +lua
Scripting engine: Lua 5.4
User data directory: /home/phlummox/.local/share/pandoc
Copyright (C) 2006-2023 John MacFarlane. Web: https://pandoc.org
This is free software; see the source for copying conditions. There is no
warranty, not even for merchantability or fitness for a particular purpose.

Operating system is Ubuntu 20.04.6 (Focal Fossa).

LaTeX version is TeX Live 2019.20200218-1:

$ dpkg --status texlive | grep Version
Version: 2019.20200218-1
jgm commented 1 year ago

I suspect this has to do with

    + Use `soul` instead of `ulem` for strikeout, underline (#8411).
      This handles things like hyphenation, line breaks, and nonbreaking
      spaces better.

Cf #8411 and https://tex.stackexchange.com/questions/160220/french-accents-in-hl-from-soul-package

jgm commented 1 year ago

Note! I don't get the error on my system, using texlive 2023. I think the reason is that the new version of soul incorporates the old soulutf8. https://ctan.math.utah.edu/ctan/tex-archive/macros/generic/soul/soul.pdf So this problem should go away if you upgrade the soul package in your latex setup.

phlummox commented 1 year ago

Hi, thanks for that! For the moment, I actually just rolled back my version of Pandoc to 2.19.2, since that was a quicker fix than upgrading LaTeX. I'm on Ubuntu 20.04, which is supposed to be maintained and updated til 2025, but 2019 is the latest TeX Live version in the Ubuntu 20.04 repositories.

The https://pandoc.org/installing.html page says "We recommend installing TeX Live via your package manager. (On Debian/Ubuntu, apt-get install texlive.)", which is how I installed TeX Live originally. But from what I can tell, looking at https://tug.org/texlive/pkginstall.html, if I want an upgraded version of "soul", I need to install "Native TeX Live" in addition to (or instead of) the version from the Ubuntu repos. I'll see if I can do so, and whether that fixes the problem.

jgm commented 1 year ago

You could probably just put the updated soul.sty in your working directory or local texmf tree.

phlummox commented 1 year ago

Thanks for the suggestion - I'll see if that works, too. But as a longer-term solution, if texlive 2023 is the minimum version of texlive that Pandoc requires, I'd like to ensure I have a reliable way of installing it on Linux distributions which don't have it.

Out of interest, if there are automated tests run on the LaTeX writer, what version of TexLive do they use? It might be worth updating the documentation to mention them, if Pandoc is only expected to work with those versions.

jgm commented 1 year ago

An alternative to installing a newer soul would be to use a custom template that imports soulutf8 instead of soul.

There are automated tests for the writer, but they just check the LaTeX code it emits; no attempt is made to compile the code using tex. (There are reasons.)