jgm / pandoc

Universal markup converter
https://pandoc.org
Other
33.95k stars 3.35k forks source link

Babel's `shorthand` option makes some characters to be skipped #6817

Open lygamac opened 3 years ago

lygamac commented 3 years ago

When setting lang:es in the header, the . decimal separator is not rendered at all (it's supposed to be transformed to a comma). French still renders the decimal separator though.

image


pandoc version: pandoc.exe 2.11.1 Compiled with pandoc-types 1.22, texmath 0.12.0.3, skylighting 0.10.0.3, citeproc 0.1.0.3, ipynb 0.1

jgm commented 3 years ago

Figure placement is done by LaTeX -- I'm not sure there's anything we can do about that in pandoc.

Second, when rendered from markdown directly to pdf, the column width of the longtable seems to be the length of the longest string, including spaces:

See the manual for information about how relative column widths are computed from the markdown source. (It depends which kind of markdown table you are using, but this is all documented. Looks like you have a pipe table, so you can adjust these widths by changing the widths of the lines under the headers.)

And last, when setting lang:es in the header, the . decimal separator is not rendered at all (it's supposed to be transformed to a comma). French still renders the decimal separator though.

I don't know why this is happening, but it seems to be something LaTeX is doing. Pandoc is just passing through the math verbatim. I'd ask about this on a LaTeX forum, using a simple pure tex example.

lygamac commented 3 years ago

I see, the problem seems to be the default latex template. I'll try to modify that to suit my use.

I found the comma problem. From the default latex template:

\ifxetex
  % Load polyglossia as late as possible: uses bidi with RTL langages (e.g. Hebrew, Arabic)
  \usepackage{polyglossia}
  \setmainlanguage[]{spanish}
\else
  \usepackage[shorthands=off,main=spanish]{babel}
\fi

The decimal is not rending due to \usepackage[shorthands=off,main=spanish]{babel} when using pdflatex. Without shorthands=off the decimal separator is rendered as comma, with the XeLatex condition above it's rendered as a point.

(Neither with shorthand=on the decimal is rendered. I don't know why but you have to remove that part in order to work.)

Seems when dot . is a shorthand character (Galician and Spanish, according to babel's documentation) the decimal operator is going to be ignored. If you @jgm don't mind, could you add something like this to the default template?

\ifxetex
  % Load polyglossia as late as possible: uses bidi with RTL langages (e.g. Hebrew, Arabic)
  \usepackage{polyglossia}
  \setmainlanguage[]{lang}
\else
  \ifdotshorthand
    \usepackage[main=lang]{babel}
  \else
    \usepackage[shorthands=off,main=lang]{babel}
  \fi
\fi

You had a typo on the comment btw! ;)

lygamac commented 3 years ago

For the figure float, I figure out the problem.

In the default template the figure float has been defined as htbp, where p stands for special page only for floats. Specifying htb! instead to ignore special pages fixed the float problem.

jgm commented 3 years ago

Thanks for this -- I'll reopen so we can consider possible changes to the default latex template.

  1. consider adding the conditional on \ifdotsshorthand
  2. consider changing the default float position to htb!.

I'm not sure that I understand yet the implications of either.

What exactly is a "shorthand," anyway?

And what would be the effect of htb!? Does it mean that special pages cannot be used, even if the alternative is putting the figure on a page with, say, one line? I'd welcome feedback from texperts about what is best here.

lygamac commented 3 years ago

Although I'm not a latex professional, as far as I understood, in short:


With more detail, shorthand is defined as:

A shorthand is a sequence of one or two characters that expands to arbitrary TeX code.

In case of the babel package, the shorthand seems to be used for localization in order to minimize the changes in the tex file. Spanish and Galician packages redefine . character as tex code such that in math mode it's replaced by a comma and in text mode, it's still a dot -- it is formal to write decimal separator as comma in these languages instead of a dot.

(If there's no need to remove the shorthands defined by babel, I'll remove that part from the code.)

htb! forces latex to choice a float between h (here, right in the latex code), t (top of the page), and b (bottom of the page). Without !, latex will still put large figures in a special page when it finds it suits the best.

Anyway, the \floatpagefraction option one can always redefine it using a --include-in-header.

EDIT: In order to have \floatpagefraction working properly, the float modifier has be forced: htbp!. So with \renewcommand{\floatpagefraction}{0.8}, the figure is only going to have a special page without text when it occupies more than 80% of the page.

jgm commented 3 years ago

I'm really confused about this. My understanding is that shorthands=off should eliminate all the "shorthands"...but then, . should remain as ., not be removed or ignored. Can anyone explain this?

I like the idea of adding

\renewcommand{\floatpagefraction}{0.8}

and changing the htbp to htbp!. Can anyone think of drawbacks? @tarleb @mb21 @adunning

lygamac commented 3 years ago

According to this, even with shorthands=off the babel package sets some parameter for the shorthand characters.

It's considered to be a babel bug in the community: latex3/babel/issues/38. However, as the Galician babel had the same problem when I tested it before, my best guess is that it was done by purpose and not a bug.

Although passing the shorthand=off is not working properly, redefining the shorthand command is working correctly. Adding:

\let\LanguageShortHands\languageshorthands
\def\languageshorthands#1{}

after the babel package would disable all the shorthands while the decimal separator is rendered correctly in both Spanish and Galician (. remains as .).

lygamac commented 3 years ago

Another thing, related to the default template. The default template's page number in the title page is always centered. The page position only takes effect beyond the second page.

Although for myself I disabled the page numbering for the title page, making it starts to display and count after the table of contents, considering that there are people who start the document immediately after the title, my style won't suit them.

As result, it might be a good idea to add something like this before \maketitle:

\makeatletter
    \@ifpackageloaded{fancyhdr}{
    \fancypagestyle{plain}{}
    }
\makeatother

So the page number position is always defined by the user in the header files.

jgm commented 3 years ago

The post you linked to recommends simply using the es-nodecimaldot option. Did you try that? Maybe we should use shorthands=off,es-nodecimaldot. That seems simpler than redefining commands.

As for the first page number, why don't you open a separate issue for that?

lygamac commented 3 years ago

es-nodecimaldot is solution only for Spanish babel. I'm afraid that there are more character and languages (for example Galician) not being rendered correctly due to the same reason.

Redefining command achieves the expected behavior for all languages.

why don't you open a separate issue for that?

On my way

jgm commented 3 years ago

I'm hesitant to add this kind of low-level workaround to the default template, if it's really a bug in babel. Maybe it's the thing to do, though.

Query: are you sure this is caused by babel and not by the additional content pandoc inserts in the template slot babel-newcommands? Looking at the source of the LaTeX writer, we have

        $ defField "babel-newcommands" (vcat $
           map (\(poly, babel) -> literal $
            -- \textspanish and \textgalician are already used by babel
            -- save them as \oritext... and let babel use that
            if poly `elem` ["spanish", "galician"]
               then "\\let\\oritext" <> poly <> "\\text" <> poly <> "\n" <>
                    "\\AddBabelHook{" <> poly <> "}{beforeextras}" <>
                      "{\\renewcommand{\\text" <> poly <> "}{\\oritext"
                      <> poly <> "}}\n" <>
                    "\\AddBabelHook{" <> poly <> "}{afterextras}" <>
                      "{\\renewcommand{\\text" <> poly <> "}[2][]{\\foreignlanguage{"
                      <> poly <> "}{##2}}}"
               else (if poly == "latin" -- see #4161
                        then "\\providecommand{\\textlatin}{}\n\\renewcommand"
                        else "\\newcommand") <> "{\\text" <> poly <>
                    "}[2][]{\\foreignlanguage{" <> babel <> "}{#2}}\n" <>
                    "\\newenvironment{" <> poly <>
                    "}[2][]{\\begin{otherlanguage}{" <>
                    babel <> "}}{\\end{otherlanguage}}"

This affects precisely spanish and galician -- might it be interfering with babel's shorthands=off setting somehow?

Looks like it was added by @mb21 in 9328f4cd3d5d5b96e7783b419214bd8599c17ebc? Maybe he can comment.

jgm commented 3 years ago

Oddly, I don't see this code appearing in latex results, with either pdflatex or xelatex as the pdf-engine. [EDIT: it appears that's because the list to which this map is applied is empty. Apparently this is just a list of languages that are used in the document, other than the main language. @mb21 is that as intended? I'd like to understand what's going on here a bit better. Note that if you add a fenced div to the document with {lang=es}, then the list is nonempty and this code gets added. However, and this I think is a separate bug, doing this causes an error with pdflatex or lualatex: "Environment spanish undefined."]

Reminder to self: as noted above, the . disappears with lang=es and --pdf-engine= pdflatex or lualatex (babel), but it is retained with xelatex (polyglossia).

jgm commented 3 years ago

It looks as if this code is designed to avoid a conflict between babel and polyglossia, but the conflict shouldn't arise, since we use one or the other, right? Or perhaps the worry is that some documentclasses might automatically load babel?

lygamac commented 3 years ago

are you sure this is caused by babel and not by the additional content pandoc inserts in the template slot babel-newcommands?

Pretty sure, I have tried with a blank tex file where only babel is included.

jgm commented 3 years ago

@mb21 did you see the query above? I'm wondering if we should remove some of this code.

mb21 commented 3 years ago

Thanks for the ping, didn't see it before...

so the reasoning/discussion for:

            -- \textspanish and \textgalician are already used by babel
            -- save them as \oritext... and let babel use that
            if poly `elem` ["spanish", "galician"]

is https://github.com/jgm/pandoc/issues/895#issuecomment-148351985 (and subsequent comments). Maybe we don't need to support the babel in TeX Live 2015 and below anymore?

Yes, the otherlangs variable is an array of languages that the document contains on spans and divs – it is not the document language. From the manual:

Use native pandoc Divs and Spans with the lang attribute to switch the language:

jgm commented 3 years ago

Maybe we don't need to support the babel in TeX Live 2015 and below anymore?

I don't think so.

mb21 commented 3 years ago

Ah, I misread the comments I linked to. From https://github.com/jgm/pandoc/issues/895#issuecomment-148531286

I’m afraid \textspanish and \textgalician are still in texlive 2015 (just not in the babel manual): in /usr/local/texlive/2015/texmf-dist/tex/generic/babel-spanish/spanish.ldf and /usr/local/texlive/2015/texmf-dist/tex/generic/babel-galician/galician.ldf

And the 2017 version I've installed has it as well in:

/usr/local/texlive/2017basic/texmf-dist/tex/generic/babel-spanish/spanish.ldf

And seems even the newest version has them:

So it seems we need to keep that hack for the moment... if we have a problem with those lines.. we could ask the the person who gave me that tip over at https://tex.stackexchange.com/questions/273512/renewcommand-textspanish ?

jgm commented 3 years ago

Apparently this bug has now been fixed in babel.

After a suitable delay, I'd like to remove the hackish code we currently include in the template to disable shorthands. So, reopening this to track it.