robustifying \ensureascii and adding \asciiensure

Udi-Fogiel commented 9 months ago

I thought it will be useful to have \asciiensure (similar to \latintext). I also think \ensureascii should be robust as it is not expandable (maybe some variant of \protected will be preferable? I'm not really sure what is the best way to protect macros in LaTeX these days, but \DeclareRobustCommand is how \latintext is defined...)

BTW, why is \latintext considered deprecated? just because it does not consider all font encodings or is there anything more fundamental?

Udi-Fogiel commented 9 months ago

See this conversation for a use of \asciiensure (specifically the last two comments).

jbezos commented 9 months ago

\ensureascii is unprotected on purpose. It’s either ignored, so there is no real reason to protect it, or it’s defined with a couple of protected macros. The last definition is activated only when fontenc is loaded with a non-LICR/non-ASCII encoding, most notably LCR (the only in real use, I think).

When \latintext and the like were introduced very long ago, babel was still essentially a package to select the language for a monolingual document. Pairs like \cyrillictext / \latintext made some sense, but not now.

The original explanation for this set of macros was:

When text is being typeset in an encoding other than ‘latin’ (OT1 or T1), it would be nice to still have Roman numerals come out in the Latin encoding.

Which clearly shows how accessory and linked to encodings like LGR and OT2 it was.

\latintext was deprecated for a few reasons:

It failed often.
It could switch the font even if unnecessary (most encodings include the ASCII range).
The script is not enough, because you also need know the language (eg, for hyphenation).

I don’t fully understand the last two comments in the liked discussion. Can you provide an example?

Udi-Fogiel commented 9 months ago

The main issue Günter, the maintainer of babel-greek, and I are facing is how to set the correct font encoding when switching from hebrew/greek to another language. The example file I posted at the start of the linked ticked produce the following code (process with pdfTeX):

\documentclass[english,greek]{article}
\usepackage[T1,LGR]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\begin{document}
Hello

\selectlanguage{english}
Hello

\end{document}

As you can see, the font encoding did not switch to an ascii compatible one, so the output is wrong. There are several ways to solve it. I suggested Günter to use \latintext as I've noticed arabi is using it, but he said that \latintext is considered deprecated, so I created the pull request.

A second option would be to require from the users to load encodings such as LGR, as a secondary encoding only and declare another encoding which is ascii compatible as the main one (which is what babel-hebrew essentially doing, but I don't like it).

As a third option, we can require from all the .ldf maintainers to make sure that the correct encoding is used when switching to their language, like all the non-latin languages are doing. I think this is the best solution, but currently non of the latin languages are doing anything related to it, and I don't know if all the languages are still actively maintained.

And lastly we can drop all support from .ldf files with regard to font encodings, and put this responsibility on the user to add the encodings he would like via \extras

Just in case it will be useful, here is how some of the non-latin languages are currently dealing with this problem

Hebrew:

\if@rl%
     \let\encodingdefault=\lr@encodingdefault%
  \fi%
  \fontencoding{\encodingdefault}%
  \selectfont%
  \@rlfalse

which is not really a good solution (I'm not sure why it is a part of rlbabel.def and not hebrew.ldf and why it involves \if@rl).

Arabic:

\addto\noextrasarabic{%
\@rlfalse
\@arabicfalse
\latintext\normalfont %enough ??
% Restore the lplain.tex penalties??
\hyphenpenalty=50%
\binoppenalty=700%
\relpenalty=500%
}

Which I'm not if this is really good, maybe with \asciiensure it will?

Greek:

\def\BabelGreekRestoreFontEncoding{%
  \ifx\cf@encoding\BabelGreekPreviousFontEncoding
  \else
    \let\encodingdefault\BabelGreekPreviousFontEncoding
    \fontencoding{\encodingdefault}\selectfont
  \fi
}
\addto\extrasgreek{%
  \let\BabelGreekPreviousFontEncoding\cf@encoding
  \greekscript}

Which is facing the problem demonstrated above.

Udi-Fogiel commented 9 months ago

In any case, guessing what should be the encoding when exiting the language is hard, the best solution would be if each language will ensure the correct encoding for itself. If it will be the case there can be a uniform solution, that can be part of the interface provided by babel.sty

jbezos commented 9 months ago

Interestingly, if you don’t load explicitly english it works as expected 😯 (yes, the load-on-the-fly feature does switch the encoding). I’m wondering why no one has reported a bug, after so many years with this behavior. I have to analyze it (backwards compatibility is a problem).

Udi-Fogiel commented 9 months ago

Interestingly, if you don’t load explicitly english it works as expected 😯 (yes, the load-on-the-fly feature does switch the encoding).

do you mean using \babelprovide? with the following code

\documentclass{article}
\usepackage[T1,LGR]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[greek]{babel}
\babelprovide[import]{english}
\begin{document}
    Hello

    \selectlanguage{english}
    Hello
\end{document}

I get

This is with version 2023/08/29 v3.93

I’m wondering why no one has reported a bug, after so many years with this behavior.

It looks as if each language maintainer dealt with it differently. One of the main advantages of the new .ini files, is that they are maintained (mainly) by one person.

I have to analyze it (backwards compatibility is a problem).

as always...

jbezos commented 9 months ago

No, no, without any explicit declaration (see the manual, sec. “Mostly monolingual documents”):

\documentclass[greek]{article}

\usepackage[T2A,T1,LGR]{fontenc}

\usepackage{babel}

\begin{document}

Ελληνικά \foreignlanguage{bulgarian}{български} Ελληνικά

\selectlanguage{english}

English \foreignlanguage{greek}{Ελληνικά} English

\end{document}

gmilde commented 9 months ago

... I’m wondering why no one has reported a bug, after so many years with this behavior.

a) The problem is new: Up to 2023/03/04, \noextrasgreek always switched to the deprecated \latinencoding

changes in babel-greek-1.12: "Save/restore previous font encoding instead of switching to \latinencoding when leaving Greek."

b) It only happens if LGR is the document's main font encoding (loaded as last font encoding with fontenc):

This is a bad idea in any case and obviously wrong for documents with Greek as secondary language.
It is easy to avoid (load LGR before a standard text encoding)¹.
Many authors will not load font encodings at all (babel-greek loads LGR if required and takes care not to make it the document default (that is the very reason for its non-standard way of loading LGR).
Most documents with Greek as main language will use Xe/LuaTeX with Unicode fonts and no fontenc.

¹ Simple in hand-authored documents but requires special-casing in LyX.

gmilde commented 9 months ago

the load-on-the-fly feature does switch the encoding

Intresting. Which font encoding is used for English? How it it determined? How does it compare to \latinencoding and the font encoding used by \ensureascii? Could the Babel core provide a \DefaultStandardTextFontEncoding for use in language files and packages?

No, no, without any explicit declaration (see the manual, sec. “Mostly monolingual documents”):

Actually, while \babelprovide[import]{english} is optional, it does not change the behaviour with an up-to-date LaTeX installation (TeXLive2023 with latest updates) here. OTOH, with TeXLive2021 from Debian/oldstable, I get Greek letters in English text parts; both, with and without \babelprovide[import]{english} .

Udi-Fogiel commented 9 months ago

the load-on-the-fly feature does switch the encoding

Intresting. Which font encoding is used for English? How it it determined?

The font encodings for each language is declared in the corresponding .ini file, I'm not sure what are the exact rules for which encoding is used if there are several options.

Could the Babel core provide a \DefaultStandardTextFontEncoding for use in language files and packages?

If I understand correctly, this is what I tried to do with \asciiensure in this pull request.

jbezos commented 9 months ago

The rules are here: https://latex3.github.io/babel/news/whats-new-in-babel-3.84.html. Perhaps a new rule is necessary: if no encoding is found, fallback to OT1.

Something like:

\extrasgreek → save current encoding
\noextrasgree → restore previous encoding

doesn’t work if the main encoding is LGR because the first \selectlanguage{greek} comes when that encoding is in force, so it’s kept when we switch to another language.

I’m thinking of some solution – perhaps making semi-public the code for the on-the-fly loading, so that it can be used in ldf files easily.

In the meanwhile, well, we have a non standard encoding, loaded in a non standard way, so a non standard solution (with a deprecated macro) doesn’t seem too severe 🙂.

Note the ASCII encoding can be, for example, T2A (for Cyrillic), because the range 32-127 are the ASCII characters, and therefore it’s not the same as a Latin encoding. In fact, a document can have several Latin encodings (T1, T4, T5).

u-fischer commented 9 months ago

Something like:

\extrasgreek → save current encoding

\noextrasgree → restore previous encoding

You probably need an additional test:

\extrasgreek : if previous encoding not equal to LGR save it

gmilde commented 9 months ago

  \extrasgreek → save current encoding
   \noextrasgreek → restore previous encoding
doesn’t work if the main encoding is LGR because the first \selectlanguage{greek} comes when that encoding is in force, [...] ¹

Generally, a Babel language should save/restore the previous state. And it should trust defaults set by the document author.

However, with LGR, we have an exception: We know that is is almost always an error to use LGR after switching to a secondary language in a Greek document. Also, until recently \noextrasgreek contained an unconditional switch to \latinencoding (overwriting the user-set default), so backwards compatibilty is an additional issue.

I propose a test "AtBeginDocument":

if main-language == "greek" and \defaultencoding == "LGR":
   \warn{Changing the document's main font encoding to an
         ASCII-compatible fallback.}
   \defaultencoding = \ASCIICompatibleFallbackFontEncoding

For this, I need

a) to know how to check in an *.ldf file whether "greek" is the main language of the document, and b) what is the most sensible ASCII-compatible fallback font encoding.

For b), it would be good to re-use the logic hidden in the \ensureascii code. One possible way to achieve this is implemented in Udi's contributions. I would prefer a more self-explaining name and mentioning in the Babel documentation.

I don't need a font-encoding-changing command for greek.ldf, only the font encoding's name.

¹ Actually, it would be OK if the main language is belarusian, bulgarian, macedonian, russianb, serbianc, ukraineb, arabic, farsi, ibycus, mongolian (and maybe more) because these do still switch to \latinencoding on leave.

jbezos commented 9 months ago

Generally, a Babel language should save/restore the previous state. And it should trust defaults set by the document author.

Generally, but not in this case, before when the first (implicit) \selectlanguage{greek} is executed the state is LGR. And this is what is “restored”.

For b), it would be good to re-use the logic hidden in the \ensureascii code.

\ensureascii is not the way to go. It’s the logic currently implemented for the automatic switching with a load on-the-fly (actually also with \babelprovide, but this feature was devised mainly for the former). I’m studying how to decouple the encoding switcher so that it can be applied to an ldf language in case the main encoding is non standard.

gmilde commented 9 months ago

\ensureascii is not the way to go. It’s the logic currently implemented for the automatic switching with a load on-the-fly …

Babel-greek cannot use the font settings from ".ini" files, as it does not know which language is used after exiting "greek".

\latinencoding is a reasonalble fallback as it resolves to one of the pre-loaded font encodings, the legacy default OT1 or T1.

The repository version of babel-greek has now implemented a fallback solution for the case that the \defaultfontencoding is LGR and the language "greek" at the begin of the document. See https://codeberg.org/milde/greek-tex/src/branch/master/babel-greek

jbezos commented 9 months ago

Babel-greek cannot use the font settings from ".ini" files, as it does not know which language is used after exiting "greek".

This is indeed the problem. I’m working on extending the encoding selector to ldf languages. Hopefully it will be included in v3.96, but I have to be careful not to break anything.

\latinencoding is a reasonalble fallback as it resolves to one of the pre-loaded font encodings, the legacy default OT1 or T1.

👌

gmilde commented 9 months ago

This is indeed the problem. I’m working on extending the encoding selector to ldf languages.

The problem is even more basic: The "on-the-fly encoding selector" works for entering a language, not for leaving. It would only help if all other languages used it (which is a breaking change of behaviour).

We have several possible scenarios for the font encoding switches:

traditional: All languages can savely assume OT1, a standard text font encoding (T1, T2A, ...), or compatible (LY, QX). They switch if required and switch back to \latinencoding when leaving (russianb, belarusian, bulgarian, macedonian², ukrainean, serbianc, ..., greek until recently).

clean up after you: switch if required, switch back to previous font encoding when leaving (greek, hebrew).

check before use: switch to one of the supported font encodings (imported languages, languages on-the-fly).

Changing the scenario has consequences when document authors or classes select a special font encoding (e.g. QX for Polish or L7x for Lithuanian). It may render documents uncompilable or lead to strange font substitutions after a section in a "foreign" language.

For backwards compatibility reasons, I would recommend to stick with the traditional scenario. I am considering whether to revert "greek.ldf" to the traditional scenario, too.

If deemed generally useful, the check before use scenario may be offerd as an opt-in variant (but this would not help with "greek.ldf" to decide which font encoding to switch to when leavin "greek").

²macedonian.ldf has a line \let\latinencoding\cf@encoding, so it may belong to the clean up after you category.

jbezos commented 9 months ago

@gmilde Good analysis. The ‘traditional’ way is fine for me.

gmilde commented 8 months ago

The example below shows the problem with the "traditional" approach: inserting a text part in a language using the "traditional" approach may break documents requiring a font encoding different from T1.

\documentclass{article}
\usepackage{parskip}

\usepackage{lmodern}
% The L7x font encoding ensures correct hyphenation in Lithuanian
\usepackage[T2A,L7x]{fontenc}
\DeclareFontFamilySubstitution{T2A}{lmr}{cmr}
\DeclareFontFamilySubstitution{T2A}{lmtt}{cmtt}

\usepackage[russian,lithuanian]{babel}

\makeatletter % we want to see the value of some internal macros
\newcommand*{\cs}[1]{\texttt{\textbackslash#1}}

\begin{document}

The document's main language is \texttt{\bbl@main@language} (lietuvių kalba).
The initial \cs{encodingdefault} is \encodingdefault.

L7x is required for correct hyphenation of Lithuanian words with, e.g.,
\emph{ogonek} accent like lietuvių) under 8-bit
TeX.\footnote{https://hyphenation.org/index.html} Similar to T1, angle
brackets and the vertical line are printed as-is | <OK>.

\selectlanguage{russian}
Русский текст (started with \cs{selectlanguage}). The font encoding is
switched by \cs{extrasrussian}
(\cs{cf@encoding} \cf@encoding, \cs{encodingdefault} \encodingdefault).
On leaving, \cs{noextrasrussian} sets the font
encoding to the \cs{latinencoding}.

\selectlanguage{lithuanian}
Lithuanian with \cs{selectlanguage}. Font encoding are now
\cs{cf@encoding} \cf@encoding, \cs{encodingdefault} \encodingdefault.

Words with ogonek accent lead to a \LaTeX error:
język polski, lietuvių kalba.

Angle brackets come out as Spanish sentence marks and the vertical line as
em-dash | <sic>.

\end{document}

This is why I prefer the clean up after you scenario.

ivankokan commented 8 months ago

Digression.

For the completeness sake, it is not only about the hyphenation. It is about the font itself as well, check the following example and the differences.

\documentclass{article}

\usepackage{lmodern}
\usepackage[L7x,T1]{fontenc}
\usepackage[utf8]{inputenc}

\input{glyphtounicode}% EDIT
\pdfgentounicode=1% EDIT

\begin{document}

T1
\fontencoding{T1}\selectfont
ŲųĮįĄąĘę

L7x
\fontencoding{L7x}\selectfont
ŲųĮįĄąĘę

\end{document}

gmilde commented 8 months ago

\latintext was deprecated for a few reasons:

failed often.

It could switch the font even if unnecessary (most encodings include the ASCII range).

The script is not enough, because you also need know the language (eg, for hyphenation).

To fix the issue with non-Greek text parts in Greek documents, babel-greek 1.5 restores the previous default encoding with one exception: if the initial \encodingdefault is LGR and the main language "greek", switch to \latinencoding instead.

Unfortunately, the fix has to make use of the deprecated \latinencoding macro, therefore I would like to know more about the issue of "failed often".

The other two issues don't apply for theuse in \noextrasgreek:

We know that LGR is not suited for any language other than Greek, so a font switch is required.
The next language is not known so we just restore a reasonable default.

The \noextrasgreek fix could gain from the more advanced and robust determination of an ASCII-compatible font encoding in \ensureascii --- if this encoding were accessible via a replacement for \latinencoding.

gmilde commented 8 months ago

For the completeness sake, it is not only about the hyphenation. It is about the font itself as well, check the following example and the differences.

Yes, pre-composed characters have several advantages. One more: drag and drop from the PDF generated by your example:

T1 U
˛u
˛ ˛i
I ˛ĄąĘę
L7x ŲųĮįĄąĘę

This also affects text search in the PDF.

One example where babel-greek using the \ensureascii encoding instead of \latinencoding would mean an improvement is hence a document with Greek as main language and Lithuanian text parts and \usepackage[L7x,LGR]{fontenc} .

OTOH, the workaround to use \usepackage[LGR,L7x]{fontenc} instead to get this right is so easy that this may not be much of an issue.

Edited: I mixed the problematic font encoding order and the fix. Corrected.

ivankokan commented 8 months ago

@gmilde I added two more lines in my original comment that resolve the copy/paste issues.

EDIT: Actually, they do not. Good point!

jbezos commented 8 months ago

Problems related to how the T1 encoding renders some combining chars must be fixed elsewhere. The relevant point here is how encodings are selected by babel. I’ve added \asciiencoding, which stores the ASCII encoding as determined by babel, so that it can be easily retrieved and modified (with commit https://github.com/latex3/babel/commit/5c746a2354ed7ffa5c443441adf6536d55a4aef6). I think there is no real need for \asciiensure.

I'm closing this pull request because it has been merged (partially) by hand.

With existing ldf and for backwards compatibility, we have to stay with what is, based on ad hoc solutions. The only real solution is to have each language select the right encoding, and it‘s too late for a change of this magnitude. Just saving and restoring is not enough as this pseudo-document shows:

Load T1, T5, T2A
Select Russian (as the main language)
Select Vietnamese || Select English

There is no encoding to switch back, and the new encoding isn’t known until either Vietnamese or English is selected.

ivankokan commented 8 months ago

@gmilde I just want to verify one thing (using Ų and LM, for example)...

One more: drag and drop from the PDF generated by your example:
T1 U
˛u
˛ ˛i
I ˛ĄąĘę
L7x ŲųĮįĄąĘę
This also affects text search in the PDF.

For L7x we have the following "chain":

\DeclareUnicodeCharacter{0172}{\k U}
=>
\DeclareTextCommand{\k}{L7x}[1]{\oalign{\null#1\crcr\hidewidth\char12}}
% the latter is then "overridden" by (which uses the pre-composed glyph for Ų)
\DeclareTextComposite{\k}{L7x}{U}{216}% "D8
=>
% and has a name
/enclml7x[... /Uogonek ...] % at position 216

On the other hand, for T1 we have only

\DeclareUnicodeCharacter{0172}{\k U}
=>
\DeclareTextCommand{\k}{T1}[1]{\hmode@bgroup\ooalign{\null#1\crcr\hidewidth\char12}\egroup}
% and there is no \DeclareTextComposite{\k} for U defined so the latter is used to mimic the Ų
% additionally, /enclmec[...] does not contain anything about Ų

which is insufficient.

On top of everything, glyphtounicode does not actually bring anything with

\pdfglyphtounicode{Uogonek}{0172}

as Uogonek (coming from /enclml7x[... /Uogonek ...]) is a recognized glyph name (https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt).

Am I right?

gmilde commented 5 months ago

I don't know the details of glyphtounicode. For the font issues, your analysis is correct: L7x has a slot for the Uogonek (and other letters with ogonek) as a separate character while T1 uses a composition of two glyphs. This is why selecting the correct font encoding matters.

For languages that use accented letters, I would recommend the ".ldf" file to switch to an encoding with full supporting (i.e. pre-composed letters in the font table) in `\extras` -- if this font encoding is defined (i.e. loaded with "fontenc") and switching back to the previous font encoding on exit. This way an document author can easily configure whether to use a font encoding with full character support or a font encoding with good font family support.

Example

\documentclass{article}
\usepackage[L7x,T1]{fontenc}  % I want L7x for Polish text parts
\usepackage{lmodern}
\usepackage[polish,english]{babel}
...

vs.

\documentclass{article}
\usepackage{andika}
\usepackage[T1]{fontenc}  % I want T1 for Polish text parts
\usepackage[polish,english]{babel}
...

jbezos commented 5 months ago

Although Polish is covered by T1, I got your point. However, this syntax is currently valid with another behavior, and changing it would break many documents. May be a package option or a macro to easy things, but I’d like to avoid adding many more options and macros for very specific situations.

I’m still ruminating about this whole fontenc thing, but don’t expect too many changes, except for some warnings or infos (like your suggestion to warn about encodings required by a language, which would be certainly useful).

latex3 / babel

robustifying \ensureascii and adding \asciiensure #263