latex3 / babel

The multilingual framework to localize LaTeX, LuaLaTeX and XeLaTeX
https://latex3.github.io/babel/
LaTeX Project Public License v1.3c
133 stars 35 forks source link

Greek as nonmain language AND Roman page numbers AND (makeindex OR xindex) = ☇ #170

Closed ghost closed 1 year ago

ghost commented 2 years ago

Feeding

\documentclass{book}
% \makeatletter
% \let\save@roman\@roman
% \let\save@Roman\@Roman
% \makeatother
\usepackage[greek,ngerman]{babel}
% \makeatletter
% \let\@roman\save@roman
% \let\@Roman\save@Roman
% \makeatother
\usepackage{makeidx}
\makeindex
\begin{document}
\pagenumbering{Roman}
\index{Text}Text
\printindex
\end{document}

as mwe.tex to pdflatex or latex leads to

\indexentry{Text}{{\fontencoding  {OT1}\selectfont  I}}

in the file mwe.idx. As a consequence, makeindex mwe or xindex mwe produce an empty file mwe.ind. Moreover, if you use xindex mwe, the program xindex fails with the output ...re/texlive/texmf-dist/tex/lualatex/xindex/xindex-lib.lua:524: bad argument #2 to 'format' (number expected, got nil). Used versions:

Given http://tex.stackexchange.com/a/356649 and http://tex.stackexchange.com/a/633522 , a possible workaround would be to uncomment the commented lines or say

\makeatletter
\def\@roman#1{\romannumeral #1}
\def\@Roman#1{\expandafter\@slowromancap\romannumeral#1@}
\makeatother

after calling babel. Another workaround is saying \def\ensureascii#1{#1} right after \begin{document} (cf. http://chat.stackexchange.com/transcript/message/60465101). However, really, it would be probably much cleaner if greek-babel or babel don't redefine \ensureascii and/or \@roman and \@Roman for all languages. (If necessary, they may do it only within the scope of Greek-language commands or if Greek is the main document language). The maintainer of babel-greek has been informed.

hvoss49 commented 2 years ago

use lualatex instead of pdflatex and it works. To me it looks like a problem with babel

u-fischer commented 2 years ago

@hvoss49 it is a problem steeming from babel-greek. It redefines \@roman and \@Roman (to ensure that they print roman numbers also when greek is active) and then they are no longer usable as page numbers in a index entry. Probably the best would be to change \index so that it can undo this definition locally.

ghost commented 2 years ago

use lualatex instead of pdflatex and it works. To me it looks like a problem with babel

@hvoss49 This is meant to be a bug report against babel-greek (or babel) specifically with pdflatex/latex. (For lualatex and xelatex I do something else anyway in my larger, non-minimal documents.)

jspitz commented 2 years ago

Similar problem with backref (see also plk/biblatex/issues/1175):

\documentclass[greek,english]{book}
\usepackage{babel}
\usepackage[backref=page]{hyperref}

%Workaround:
%\usepackage{etoolbox}
%\AtBeginDocument{\robustify\ensureascii}

\begin{document}
    \frontmatter
    \cite{article-minimal}
    \mainmatter
    \cite{article-minimal}
    \bibliographystyle{plain}
    \bibliography{xampl}
\end{document}
jbezos commented 2 years ago

@jspitz There is a combination of factors, and as usual an \edef, with an unprotected expansion, is involved (in hyperref), which means \protect’s are just ignored, while \robustify relies on the primitive \protected. Although, IMO, the real culprit is the \edef, I’ll investigate if \ensureascii can be based on \protected instead of on \protect.

u-fischer commented 2 years ago

@jbezos making \ensureascii robust will avoid the error but break the links. Hyperref will then try to create links to destinations like page.\ensureascii {ii}:

pdfTeX warning (dest): name{page.\\ensureascii\040{ii}} has been referenced but
 does not exist, replaced by a fixed one

pdfTeX warning (dest): name{page.\\ensureascii\040{i}} has been referenced but 
does not exist, replaced by a fixed one

I'm not sure how to get around the problem, but at the core hyperref assumes that \thepage is expandable and doesn't contain formatting instructions like \fontencoding{OT1}\selectfont.

ghost commented 2 years ago

First, you might wish to check whether, in the case of nonmain greek, babel(-greek) could redefine stuff it needs to redefine only within the scope of language-changing commands and environments. (This holds for any language, not just for Greek. E.g., I have also french as a nonmain language and have been getting French-and-caption-related warnings in the log ever since though the only French text I have in a huge German document are a few French proper names. In the long run, useless warnings are nervous.) This will at least mitigate this issue and similar issues. Second, if after that you still have a real-world error and need both plain-text and formatted page numbers, you might perhaps choose to split \thepage in two versions: one for printing (with any kinds of formatting commands that go with it), another for referencing (plain text, without any kinds of commands including \ensureascii). Or even better, instead of trying to remove the formatting from a formatted page number (which raises issues) on need, you could add formatting to a plain-text page number on need; adding formatting commands is easier than removing them, after all.

jspitz commented 2 years ago

First, you might wish to check whether, in the case of nonmain greek, babel(-greek) could redefine stuff it needs to redefine only within the scope of language-changing commands and environments.

Like this:

--- /usr/local/texlive/current/texmf-dist/tex/generic/babel-greek/greek.ldf
+++ /tmp/meld-tmp_0c8523p.ldf
@@ -104,6 +104,10 @@
     \makeatother
   }{}
 }
+\let\latin@roman\@roman
+\let\latin@Roman\@Roman
+\let\bbl@greek@roman\@roman
+\let\bbl@greek@Roman\@Roman
 \@ifl@aded{def}{lgrenc}{%
   \ProvideTextCommand{\textcopyright}{LGR}{\ensureascii{\textcopyright}}
   \ProvideTextCommand{\textregistered}{LGR}{\ensureascii{\textregistered}}
@@ -113,8 +117,8 @@
   \ProvideTextCommand{\textampersand}{LGR}{\ensureascii{\ltx@amp}}
   \DeclareRobustCommand{\&}{\ifmmode\ltx@amp\else\textampersand\fi}
   \ProvideTextCommand{\SS}{LGR}{\ensureascii{\SS}}
-  \def\@roman#1{\expandafter\ensureascii\expandafter{\romannumeral#1}}
-  \def\@Roman#1{\expandafter\ensureascii\expandafter{%
+  \def\bbl@greek@roman#1{\expandafter\ensureascii\expandafter{\romannumeral#1}}
+  \def\bbl@greek@Roman#1{\expandafter\ensureascii\expandafter{%
                 \expandafter\@slowromancap\romannumeral#1@}}
   \DeclareRobustCommand{\greektext}{%
     \fontencoding{LGR}\selectfont
@@ -486,6 +490,14 @@
   \DeclareTextCompositeCommand{\`}{LGR}{^^9f}{\LGR@hiatus}
   \addto\extraspolutonikogreek{\languageshorthands{greek}}%
   \declare@shorthand{greek}{~}{\greek@tilde}
+  \addto\extrasgreek{%
+    \let\@roman\bbl@greek@roman
+    \let\@Roman\bbl@greek@Roman
+  }
+  \addto\noextrasgreek{%
+    \let\@roman\latin@roman
+    \let\@Roman\latin@Roman
+  }
 }{} % End of LGR-specific code.
 \providecommand*{\anwtonos}{\textdexiakeraia}
 \providecommand*{\katwtonos}{\textaristerikeraia}

This won't help if Greek is the active language, so a fix at the core is still needed. But I think it should be done anyway.

u-fischer commented 2 years ago

babel(-greek) could redefine stuff it needs to redefine only within the scope of language-changing commands and environments.

Well that wouldn't help in the case of page numbers as the language scope in which such a page number is stored (with \label etc) can be different from the scope in which it is used (via \ref etc).

you might perhaps choose to split \thepage in two versions:

That plan is better, one should store the number (e.g.) and the intended formatting (e.g. \roman). But the problem here is again with labels and ref: Even with hyperref, which extends the label system, there is not enough place to move both informations around.

ghost commented 2 years ago

Well that wouldn't help in the case of page numbers as the language scope in which such a page number is stored (with \label etc) can be different from the scope in which it is used (via \ref etc).

Perhaps, one could think of attempting to store some of the necessary local data, such as the language or the formatting, together with or alongside the stored label and to use this data at the point of the usage of the label.

u-fischer commented 2 years ago

well you can try with zref, it allows you to store more data. But it will not be trivial to format all locations like index and bibliographies where page numbers are used correctly. Imho the best is to drop LGR encoding and \ensureascii by using an unicode engine.

ghost commented 2 years ago

To the systems architect in me, storing more data seems to border on a small architectural change, which seems to require more work than a quick-and-dirty hack. (Architectural changes and cleanups for most software projects are inevitably required; if their code evolves in small quick-and-dirty steps, it usually becomes unmanageable ad-hoc spaghetti. I view switching to [Xe|Lua]LaTeX as another, more profound architectural change. Still, [pdf]latex is alive and does not have a trivial replacement everywhere yet: I personally dealt with svmono and arxiv.org.)

jspitz commented 2 years ago

babel(-greek) could redefine stuff it needs to redefine only within the scope of language-changing commands and environments.

Well that wouldn't help in the case of page numbers as the language scope in which such a page number is stored (with \label etc) can be different from the scope in which it is used (via \ref etc).

True, but it helps in all cases where Greek is not active (but loaded), e.g. the MWE in https://github.com/latex3/babel/issues/170#issuecomment-1229431476. In any case, I don't see why babel-greek should globally redefine \@roman once and for all.

u-fischer commented 2 years ago

True, but it helps in all cases where Greek is not active (but loaded),

If you don't have references to roman numbers in greek parts of your document, you can simply reinstate the default LaTeX definitions everywhere and be done with it. But if you have such references then they will error or give faulty links or faulty output with your solution. So what do you gain?

jspitz commented 2 years ago

@u-fischer you are right, I forgot about \pageref to non-Greek roman pages within Greek context. So I retract my proposal.

jspitz commented 2 years ago

@PeterMuellerr FYI that's this case:

\documentclass[greek,english]{book}
\usepackage{babel}

\begin{document}
    \frontmatter
    a\label{x}
    \clearpage
    \selectlanguage{greek}Page \pageref{x}
\end{document}

This would falsely come out as ι (rather than i) if the \@roman redefinition would be restricted to Greek language context.

ghost commented 2 years ago

@jspitz

\selectlanguage{greek}Page \pageref{x}

Thanks! As of now, it comes out as “Παγε i”. I believe you concerning “ι” if you have tested this. I apologize for having forgotten that there is persistent, stored stuff that has to be dealt with, too. Anyway, is this a realistic example? Wouldn't you write, perhaps, \selectlanguage{greek}Σελίδα \selectlanguage{english}\pageref{x} instead of \selectlanguage{greek}Page \pageref{x}? If a user switches the language for the main text himself/herself, it could be argued that he/she should switch or consider switching languages for the references, too, since, in high-level terms, “i” is not Greek-language text. Of course, I know the user should better be relieved of switching languages himself/herself.

jbezos commented 2 years ago

making \ensureascii robust will avoid the error but break the links. Hyperref will then try to create links to destinations like page.\ensureascii {ii}:

@u-fischer You’ve closed my investigations before I started 🙂, but I was expecting something like that. Of course, the root of the problem is assuming \thepage is fully expandable. See, for example:

But, I agree \roman must not be redefined globally when Greek isn’t the main language, and the current maintainer of babel-greek has been informed. Maybe it's time to insist.

Second, if after that you still have a real-world error and need both plain-text and formatted page numbers, you might perhaps choose to split \thepage in two versions: one for printing (with any kinds of formatting commands that go with it), another for referencing (plain text, without any kinds of commands including \ensureascii).

@PeterMuellerr This would be the ideal solution, even if, as pointed out by Ulrike, it’s not trivial.

u-fischer commented 2 years ago

You’ve closed my investigations before I started

@jbezos well I think some investigation in this area are needed, not only for roman/greek. The backref example also fails for spanish, and a number of packages redefine also \@arabic which can lead to problems too.

Of course, the root of the problem is assuming \thepage is fully expandable

The root of the problem is that page numbers are used in many places by various tools with differing requirements: makeindex wants to sort them, biblatex wants to compress page references to ranges, hyperref wants to create destinations, links and page labels, the label/ref wants to move it through the aux, and the document wants to print them in various formattings depending on the current language and the place where it is printed. All this works fine if page numbers are expandable and expand to something simple but gets quite difficult if language depending formatting is added.

jbezos commented 2 years ago

@u-fischer Or fancy, but valid, formats like 3▪4 (where 3 is the chapter and 4 the page in the chapter, and ▪ is your favorite bullet in your favorite font, or even an image).

u-fischer commented 2 years ago

@jbezos Yes. If you want to investigate here an example. As you can see the main problem is the index and not so much hyperref. The example ignores the problem of multi languages. Also the XXXX- in definition of \blub is expandable, and so the not-protected version works in the index here (but break page links) but in in real world examples it would contain e.g. font selections command which can expand in the index and so would break there too.

If someone could come up with a good idea how to handle the index I could add support in hyperref - but I won't add a variety of commands like \ensureascii etc, it then should be one common command (or perhaps one for each numbering style) and the language files would have to coordinate their access to such a formatting command.

\documentclass[]{book}

\usepackage{index,etoolbox}
\makeindex
\usepackage[backref=page]{hyperref}

\makeatletter
\def\@roman#1{\expandafter\blub\expandafter{\romannumeral#1}}

% \newcommand\blub[1]{XXXX-#1} %not protected works more less in index, but breaks links
% \DeclareRobustCommand\blub[1]{XXXX-#1} robust 
\protected\def\blub#1{XXXX-#1} %protected %miss index entries

\pdfstringdefDisableCommands{\let\blub\@firstofone}
\patchcmd\hyper@link@{\edef\Hy@tempb{#3}}{\let\blub\@firstofone\edef\Hy@tempb{#3}}{}{\fail}

\begin{document}
\frontmatter 
    first page 
    \index{orange}
 \newpage 
    \cite{article-minimal}
    second page \phantomsection\label{abc}
\mainmatter
    mainmatter
    \index{duck}\index{orange}
    \pageref{abc}
    \cite{article-minimal}
    \bibliographystyle{plain}
    \bibliography{xampl}
    \printindex
\end{document}
ghost commented 2 years ago

@u-fischer Concerning your proposal of using zref, do you think of something similar to the code below? It would allow us, IMHO, to store \languagename, \the\c@page, and formatting separately.

An investigation of concept; too simple for real life:

\documentclass{book}
\usepackage{zref}
\usepackage[greek,english]{babel}
\makeatletter
\zref@newlist{pageWithLang}
\zref@newprop*{lang}[english]{\languagename}
\zref@addprops{pageWithLang}{page,lang}
\newcommand{\labelWithLang}[1]{%
  \zref@setcurrent{page}{\thepage}%% Or, say, \romannumeral\the\c@page , if you know that the last command changing pagenumbering set page numbers to roman.
  \zref@setcurrent{lang}{\languagename}%% This is a simplification.  Frankly speaking, I don't know how to get the language with which the current page number has been created. Usually any Latin-based language would do, but page numbers can also be Hebrew, for example .
  \zref@labelbylist{#1}{pageWithLang}%
}
\newcommand{\pagerefWithLang}[1]{%
  \foreignlanguage{%
    \zref@extract{#1}{lang}%
  }{%
    \zref@extract{#1}{page}%
  }%
}
\makeatother
\begin{document}
    \frontmatter
    \labelWithLang{englishPageLabel}English page\\
    Pages \pagerefWithLang{englishPageLabel} and \pagerefWithLang{greekPageLabel}.
    \clearpage
    \selectlanguage{greek}
    \labelWithLang{greekPageLabel}Ελληνική σελίδα\\
    Σελίδες \pagerefWithLang{englishPageLabel} και \pagerefWithLang{greekPageLabel}.
\end{document}

@jspitz Would such a code (perhaps, after some changes) jive with your proposal of local-only redefinitions of stuff in greek.ldf?

jspitz commented 2 years ago

I am not sure I understand the plan. I don't think babel wants to load zref. Maybe in the long term, as LaTeX already started to include some of zref's concepts (see \@currentcounter), the LaTeX kernel could provide a way to separate page numer formatting from the actual page number.

ghost commented 2 years ago

I am not sure I understand the plan.

I thought that one of Ulrike's earlier suggestions was to use zref. The way I understood this, this would provide us with an opportunity to store the plain-text page number separately from language and formatting (given enough effort, no doubt about that). This would allow us to get rid of global redefinitions of stuff by .ldf similar to what you tried out. (As for what babel wants or doesn't want to do, it's probably not up to me to comment on that or to suggest that anyone does anything; as of now, I wouldn't be able to execute any changes in babel in general or greek.ldf in particular anyway.)

jspitz commented 2 years ago

The approach helps to get a proper page reference also with the change to greek.ldf I proposed. But it's more a user workaround I think than a fix of the problem at the core.

ghost commented 2 years ago

The approach helps to get a proper page reference also with the change to greek.ldf I proposed. But it's more a user workaround I think than a fix of the problem at the core.

Yes because it seems that for the fix (in this specific way), we would have to change \label, \ref, and \pageref rather than to introduce \labelWithLang, \pagerefWithLang, … .

gmilde commented 1 year ago

The problem is more about font encoding and script, only indirect about Babel (because Babel-Greek has to ensure that the Greek script is supported).

The core of the problem is that \roman and \Roman expect the active font encoding to be a "standard text font encoding" but LGR is non-standard :( Solving this at the core would require a) support for T7 (standard Greek text encoding, currently not defined), or b) \roman and \Roman as NFSS "TextCommand"s (similar to \copyright).

For a), we would need agreement on a character table, font encoding definition files and a set of re-encoded fonts. Work on T7 stalled when the Greek TeX community decided that Unicode was better suited for typesetting Greek. However, for monotonic Greek on 8-bit TeX it would still be a vast improvement over LGR.

For b), we would need support for NFSS TextCommands in the places where \roman and \Roman are used.

Any change to "greek.ldf" should be checked for adverse side-effects. E.g., in Greek documents, roman numbering is used for nested enumerated lists. If there is an agreement on the best way forward, I am more than happy to implement it in either "greek-fontenc" or "babel-greek".

For non-Greek documents with the occasional Greek symbol or term, babel-greek is an overkill. Using "textalpha" or "alphabeta" instead should solve the indexing and backref problems:

\documentclass{book}

\usepackage{textalpha}
\usepackage[ngerman]{babel}

\usepackage{makeidx}
\makeindex
\begin{document}
\pagenumbering{Roman}

\index{Text}Text

Some text using Greek script: \ensuregreek{λογος}.

Roman numbering is left untouched and fails with Greek with an 8-bit engine:

% abuse some exisisting counters for a quick test:
\setcounter{enumi}{5}
\setcounter{enumii}{3}
\setcounter{enumiii}{9}

item \Roman{enumi}.\roman{enumii}.\roman{enumiii} vs.
\ensuregreek{αντικείμενο \Roman{enumi}.\roman{enumii}.\roman{enumiii}}

\printindex
\end{document}
jbezos commented 1 year ago

Sure, the LGR is problematic, but the point here is \roman and \Roman are modified for all languages, while changes should be local or, at least, ‘global’ solely when Greek is the main language, so that \thepage, which is also ‘global’, prints the correct numeral. A second issue is \makeindex understands text and nothing else (related issue: https://github.com/latex3/babel/issues/26). On the other hand, I think not supporting greek as a secondary language is not a real solution.

I was working on a new feature (or, rather, on improving an existing one), which will allow to write something like that:

\documentclass{article}
\usepackage[LGR, T1]{fontenc}
\usepackage[english]{babel}
\begin{document}
English \foreignlanguage{greek}{Ελληνικά} English.
\end{document}

(It’s based on https://latex3.github.io/babel/guides/locale-arabic.html#pdftex.)

gmilde commented 1 year ago

Am 8.12.22 schrieb Javier Bezos:

... the point here is \roman and \Roman are modified for all languages, while changes should be local or, at least, ‘global’ solely when Greek is the main language

I would prefer local-only changes, too. However this has the potential to silently break existing documents. (While a Iota for number 1 may be only a style problem, the V for number 5 becomes a no-break space!)

An example where the LGR-proof Roman numerals are required also with Greek as secondary language:

\documentclass[a4paper,oneside]{book}

% Save original definition
\makeatletter
\let\bbl@greek@save@roman\@roman
\let\bbl@greek@save@Roman\@Roman
\makeatother

\usepackage[greek,english]{babel}

\makeatletter

% Restore original definition
\let\@roman\bbl@greek@save@roman
\let\@Roman\bbl@greek@save@Roman

% Make Roman numerals LGR-proof only if Greek is the active language:

% LGR-proof Roman numerals
\def\lgr@proof@roman#1{\expandafter\ensureascii\expandafter{\romannumeral#1}}
\def\lgr@proof@Roman#1{\expandafter\ensureascii\expandafter{%
                       \expandafter\@slowromancap\romannumeral#1@}}
% Switch between original and LGR-proofed version              
\addto\extrasgreek{%
  \let\@roman\lgr@proof@roman
  \let\@Roman\lgr@proof@Roman
}
\addto\noextrasgreek{%
  \let\@roman\bbl@greek@save@roman
  \let\@Roman\bbl@greek@save@Roman
}

\makeatother

\begin{document}
\frontmatter

\tableofcontents  % Check for Iota in Roman page number!

\chapter{English Preface \label{ch:preface}}
Use case:
a document with Greek chapter in the ``frontmatter'' and a ToC.

\selectlanguage{greek}
\chapter{Greek Preface \label{ch:preface-greek}}
logos
\selectlanguage{english}

\mainmatter

\chapter{First Chapter \label{ch:1}}
The English ``preface'' is at page \pageref{ch:preface}.
The Greek ``preface'' is at page \pageref{ch:preface-greek}.

\selectlanguage{greek}
The English ``preface'' is at page \pageref{ch:preface}.
The Greek ``preface'' is at page \pageref{ch:preface-greek}.

\end{document}

A second issue is \makeindex understands text and nothing else (related issue: #26).

It seems "makeindex" can handle TeX macros that have a replacement in *.ist files, like

% save printable macros
merge_rule  "\\TeX"     "TeX"

Maybe we can fix "makeindex" ensuring \ensureascii ends up in the *.idx file and add a merge_rule that removes it for the index generation?

jbezos commented 1 year ago

👌 Good example. I’ll study it, but it seems this issue is going to become (another) ‘known issue‘ of the LGR encoding.

gmilde commented 1 year ago

Maybe we need a new language option "global-lgr-fixes=[on|off]" or so. After a transition period, the default could become "off".

jbezos commented 1 year ago

I’m closing this issue for two reasons. (1) It’s an intrinsic limitation of the non-standard LGR encoding, which is not really part of the babel core. (2) There is now (3.84) a simple alternative to set more or less short Greek texts as a secondary language (see What’s new in babel 3.84).

gmilde commented 1 year ago

Thank you for providing another workaround (for small Greek text parts) in Babel 3.84. (There is a small but confusing documentation error in What’s new in babel 3.84): fontspec -> fontenc (you cannot set a font encoding with fontspec).)

This adds one level of language support (hyphenation), but would not help in documents requiring translated auto-strings (e.g. for a Greek abstract). I'd like to see more testing and better documentation.

I prepared a new version for contributed babel-greek package and opened an issue there https://codeberg.org/milde/greek-tex/issues/1.

gmilde commented 1 year ago

@u-fischer @jspitz

... I could add support in hyperref - but I won't add a variety of commands like \ensureascii etc, it then should be one common command (or perhaps one for each numbering style) and the language files would have to coordinate their access to such a formatting command.

babel-greek tries to solve this with a new "TextCommand" (see commit 0f56b).

  \ProvideTextCommandDefault{\EnsureStandardFontEncoding}{\@firstofone}
  \ProvideTextCommand{\EnsureStandardFontEncoding}{LGR}[1]{%
                                                      \ensureascii{#1}}
  \AtBeginDocument{\@ifpackageloaded{hyperref}
                     {\pdfstringdefDisableCommands{%
                         \let\EnsureStandardFontEncoding\@firstofone}}
                     {}}

This seems to fix the "backref" issues in my tests. Is there anything missing regarding hyperref?