Closed Udi-Fogiel closed 8 months ago
See this conversation for a use of \asciiensure
(specifically the last two comments).
\ensureascii
is unprotected on purpose. It’s either ignored, so there is no real reason to protect it, or it’s defined with a couple of protected macros. The last definition is activated only when fontenc
is loaded with a non-LICR/non-ASCII encoding, most notably LCR
(the only in real use, I think).
When \latintext
and the like were introduced very long ago, babel
was still essentially a package to select the language for a monolingual document. Pairs like \cyrillictext
/ \latintext
made some sense, but not now.
The original explanation for this set of macros was:
When text is being typeset in an encoding other than ‘latin’ (OT1 or T1), it would be nice to still have Roman numerals come out in the Latin encoding.
Which clearly shows how accessory and linked to encodings like LGR
and OT2
it was.
\latintext
was deprecated for a few reasons:
I don’t fully understand the last two comments in the liked discussion. Can you provide an example?
The main issue Günter, the maintainer of babel-greek, and I are facing is how to set the correct font encoding when switching from hebrew/greek to another language. The example file I posted at the start of the linked ticked produce the following code (process with pdfTeX):
\documentclass[english,greek]{article}
\usepackage[T1,LGR]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage{babel}
\begin{document}
Hello
\selectlanguage{english}
Hello
\end{document}
As you can see, the font encoding did not switch to an ascii compatible one, so the output is wrong. There are several ways to solve it. I suggested Günter to use \latintext
as I've noticed arabi
is using it, but he said that \latintext
is considered deprecated, so I created the pull request.
A second option would be to require from the users to load encodings such as LGR
, as a secondary encoding only and declare another encoding which is ascii compatible as the main one (which is what babel-hebrew essentially doing, but I don't like it).
As a third option, we can require from all the .ldf
maintainers to make sure that the correct encoding is used when switching to their language, like all the non-latin languages are doing. I think this is the best solution, but currently non of the latin languages are doing anything related to it, and I don't know if all the languages are still actively maintained.
And lastly we can drop all support from .ldf
files with regard to font encodings, and put this responsibility on the user to add the encodings he would like via \extras
Just in case it will be useful, here is how some of the non-latin languages are currently dealing with this problem
Hebrew:
\if@rl%
\let\encodingdefault=\lr@encodingdefault%
\fi%
\fontencoding{\encodingdefault}%
\selectfont%
\@rlfalse
which is not really a good solution (I'm not sure why it is a part of rlbabel.def
and not hebrew.ldf
and why it involves \if@rl
).
Arabic:
\addto\noextrasarabic{%
\@rlfalse
\@arabicfalse
\latintext\normalfont %enough ??
% Restore the lplain.tex penalties??
\hyphenpenalty=50%
\binoppenalty=700%
\relpenalty=500%
}
Which I'm not if this is really good, maybe with \asciiensure
it will?
Greek:
\def\BabelGreekRestoreFontEncoding{%
\ifx\cf@encoding\BabelGreekPreviousFontEncoding
\else
\let\encodingdefault\BabelGreekPreviousFontEncoding
\fontencoding{\encodingdefault}\selectfont
\fi
}
\addto\extrasgreek{%
\let\BabelGreekPreviousFontEncoding\cf@encoding
\greekscript}
Which is facing the problem demonstrated above.
In any case, guessing what should be the encoding when exiting the language is hard, the best solution would be if each language will ensure the correct encoding for itself. If it will be the case there can be a uniform solution, that can be part of the interface provided by babel.sty
Interestingly, if you don’t load explicitly english
it works as expected 😯 (yes, the load-on-the-fly feature does switch the encoding). I’m wondering why no one has reported a bug, after so many years with this behavior. I have to analyze it (backwards compatibility is a problem).
Interestingly, if you don’t load explicitly
english
it works as expected 😯 (yes, the load-on-the-fly feature does switch the encoding).
do you mean using \babelprovide
? with the following code
\documentclass{article}
\usepackage[T1,LGR]{fontenc}
\usepackage[utf8]{inputenc}
\usepackage[greek]{babel}
\babelprovide[import]{english}
\begin{document}
Hello
\selectlanguage{english}
Hello
\end{document}
I get
This is with version 2023/08/29 v3.93
I’m wondering why no one has reported a bug, after so many years with this behavior.
It looks as if each language maintainer dealt with it differently. One of the main advantages of the new .ini
files, is that they are maintained (mainly) by one person.
I have to analyze it (backwards compatibility is a problem).
as always...
No, no, without any explicit declaration (see the manual, sec. “Mostly monolingual documents”):
\documentclass[greek]{article}
\usepackage[T2A,T1,LGR]{fontenc}
\usepackage{babel}
\begin{document}
Ελληνικά \foreignlanguage{bulgarian}{български} Ελληνικά
\selectlanguage{english}
English \foreignlanguage{greek}{Ελληνικά} English
\end{document}
... I’m wondering why no one has reported a bug, after so many years with this behavior.
a) The problem is new: Up to 2023/03/04, \noextrasgreek
always switched to the deprecated \latinencoding
changes in babel-greek-1.12: "Save/restore previous font encoding instead of switching to
\latinencoding
when leaving Greek."
b) It only happens if LGR is the document's main font encoding (loaded as last font encoding with fontenc):
¹ Simple in hand-authored documents but requires special-casing in LyX.
the load-on-the-fly feature does switch the encoding
Intresting.
Which font encoding is used for English?
How it it determined? How does it compare to \latinencoding
and the font encoding used by \ensureascii
?
Could the Babel core provide a \DefaultStandardTextFontEncoding
for use in language files and packages?
No, no, without any explicit declaration (see the manual, sec. “Mostly monolingual documents”):
Actually, while \babelprovide[import]{english}
is optional, it does not change the behaviour with an up-to-date LaTeX installation (TeXLive2023 with latest updates) here.
OTOH, with TeXLive2021 from Debian/oldstable, I get Greek letters in English text parts; both, with and without \babelprovide[import]{english}
.
the load-on-the-fly feature does switch the encoding
Intresting. Which font encoding is used for English? How it it determined?
The font encodings for each language is declared in the corresponding .ini file, I'm not sure what are the exact rules for which encoding is used if there are several options.
Could the Babel core provide a
\DefaultStandardTextFontEncoding
for use in language files and packages?
If I understand correctly, this is what I tried to do with \asciiensure
in this pull request.
The rules are here: https://latex3.github.io/babel/news/whats-new-in-babel-3.84.html. Perhaps a new rule is necessary: if no encoding is found, fallback to OT1
.
Something like:
\extrasgreek
→ save current encoding\noextrasgree
→ restore previous encoding doesn’t work if the main encoding is LGR
because the first \selectlanguage{greek}
comes when that encoding is in force, so it’s kept when we switch to another language.
I’m thinking of some solution – perhaps making semi-public the code for the on-the-fly loading, so that it can be used in ldf
files easily.
In the meanwhile, well, we have a non standard encoding, loaded in a non standard way, so a non standard solution (with a deprecated macro) doesn’t seem too severe 🙂.
Note the ASCII encoding can be, for example, T2A (for Cyrillic), because the range 32-127 are the ASCII characters, and therefore it’s not the same as a Latin encoding. In fact, a document can have several Latin encodings (T1
, T4
, T5
).
Something like:
\extrasgreek
→ save current encoding\noextrasgree
→ restore previous encoding
You probably need an additional test:
\extrasgreek
: if previous encoding not equal to LGR save it\extrasgreek → save current encoding \noextrasgreek → restore previous encoding
doesn’t work if the main encoding is LGR because the first \selectlanguage{greek} comes when that encoding is in force, [...] ¹
Generally, a Babel language should save/restore the previous state. And it should trust defaults set by the document author.
However, with LGR, we have an exception: We know that is is almost always an error to use LGR after switching to a secondary language in a Greek document.
Also, until recently \noextrasgreek
contained an unconditional switch to \latinencoding
(overwriting the user-set default), so backwards compatibilty is an additional issue.
I propose a test "AtBeginDocument":
if main-language == "greek" and \defaultencoding == "LGR":
\warn{Changing the document's main font encoding to an
ASCII-compatible fallback.}
\defaultencoding = \ASCIICompatibleFallbackFontEncoding
For this, I need
a) to know how to check in an *.ldf file whether "greek" is the main language of the document, and b) what is the most sensible ASCII-compatible fallback font encoding.
For b), it would be good to re-use the logic hidden in the \ensureascii
code.
One possible way to achieve this is implemented in Udi's contributions.
I would prefer a more self-explaining name and mentioning in the Babel documentation.
I don't need a font-encoding-changing command for greek.ldf, only the font encoding's name.
¹ Actually, it would be OK if the main language is belarusian, bulgarian,
macedonian, russianb, serbianc, ukraineb, arabic, farsi, ibycus, mongolian
(and maybe more) because these do still switch to \latinencoding
on leave.
Generally, a Babel language should save/restore the previous state. And it should trust defaults set by the document author.
Generally, but not in this case, before when the first (implicit) \selectlanguage{greek}
is executed the state is LGR
. And this is what is “restored”.
For b), it would be good to re-use the logic hidden in the
\ensureascii
code.
\ensureascii
is not the way to go. It’s the logic currently implemented for the automatic switching with a load on-the-fly (actually also with \babelprovide
, but this feature was devised mainly for the former). I’m studying how to decouple the encoding switcher so that it can be applied to an ldf
language in case the main encoding is non standard.
\ensureascii is not the way to go. It’s the logic currently implemented for the automatic switching with a load on-the-fly …
Babel-greek cannot use the font settings from ".ini" files, as it does not know which language is used after exiting "greek".
\latinencoding
is a reasonalble fallback as it resolves to one of the pre-loaded font encodings, the legacy default OT1 or T1.
The repository version of babel-greek has now implemented a fallback solution for the case that the \defaultfontencoding
is LGR and the language "greek" at the begin of the document.
See https://codeberg.org/milde/greek-tex/src/branch/master/babel-greek
Babel-greek cannot use the font settings from ".ini" files, as it does not know which language is used after exiting "greek".
This is indeed the problem. I’m working on extending the encoding selector to ldf
languages. Hopefully it will be included in v3.96, but I have to be careful not to break anything.
\latinencoding
is a reasonalble fallback as it resolves to one of the pre-loaded font encodings, the legacy default OT1 or T1.
👌
This is indeed the problem. I’m working on extending the encoding selector to ldf languages.
The problem is even more basic: The "on-the-fly encoding selector" works for entering a language, not for leaving. It would only help if all other languages used it (which is a breaking change of behaviour).
We have several possible scenarios for the font encoding switches:
traditional: All languages can savely assume OT1, a standard text font encoding (T1, T2A, ...), or compatible (LY, QX).
They switch if required and switch back to \latinencoding
when leaving
(russianb, belarusian, bulgarian, macedonian², ukrainean, serbianc, ..., greek until recently).
clean up after you: switch if required, switch back to previous font encoding when leaving (greek, hebrew).
check before use: switch to one of the supported font encodings (imported languages, languages on-the-fly).
Changing the scenario has consequences when document authors or classes select a special font encoding (e.g. QX for Polish or L7x for Lithuanian). It may render documents uncompilable or lead to strange font substitutions after a section in a "foreign" language.
For backwards compatibility reasons, I would recommend to stick with the traditional scenario. I am considering whether to revert "greek.ldf" to the traditional scenario, too.
If deemed generally useful, the check before use scenario may be offerd as an opt-in variant (but this would not help with "greek.ldf" to decide which font encoding to switch to when leavin "greek").
²macedonian.ldf has a line \let\latinencoding\cf@encoding
, so it may belong to the clean up after you category.
@gmilde Good analysis. The ‘traditional’ way is fine for me.
The example below shows the problem with the "traditional" approach: inserting a text part in a language using the "traditional" approach may break documents requiring a font encoding different from T1.
\documentclass{article}
\usepackage{parskip}
\usepackage{lmodern}
% The L7x font encoding ensures correct hyphenation in Lithuanian
\usepackage[T2A,L7x]{fontenc}
\DeclareFontFamilySubstitution{T2A}{lmr}{cmr}
\DeclareFontFamilySubstitution{T2A}{lmtt}{cmtt}
\usepackage[russian,lithuanian]{babel}
\makeatletter % we want to see the value of some internal macros
\newcommand*{\cs}[1]{\texttt{\textbackslash#1}}
\begin{document}
The document's main language is \texttt{\bbl@main@language} (lietuvių kalba).
The initial \cs{encodingdefault} is \encodingdefault.
L7x is required for correct hyphenation of Lithuanian words with, e.g.,
\emph{ogonek} accent like lietuvių) under 8-bit
TeX.\footnote{https://hyphenation.org/index.html} Similar to T1, angle
brackets and the vertical line are printed as-is | <OK>.
\selectlanguage{russian}
Русский текст (started with \cs{selectlanguage}). The font encoding is
switched by \cs{extrasrussian}
(\cs{cf@encoding} \cf@encoding, \cs{encodingdefault} \encodingdefault).
On leaving, \cs{noextrasrussian} sets the font
encoding to the \cs{latinencoding}.
\selectlanguage{lithuanian}
Lithuanian with \cs{selectlanguage}. Font encoding are now
\cs{cf@encoding} \cf@encoding, \cs{encodingdefault} \encodingdefault.
Words with ogonek accent lead to a \LaTeX error:
język polski, lietuvių kalba.
Angle brackets come out as Spanish sentence marks and the vertical line as
em-dash | <sic>.
\end{document}
This is why I prefer the clean up after you scenario.
Digression.
For the completeness sake, it is not only about the hyphenation. It is about the font itself as well, check the following example and the differences.
\documentclass{article}
\usepackage{lmodern}
\usepackage[L7x,T1]{fontenc}
\usepackage[utf8]{inputenc}
\input{glyphtounicode}% EDIT
\pdfgentounicode=1% EDIT
\begin{document}
T1
\fontencoding{T1}\selectfont
ŲųĮįĄąĘę
L7x
\fontencoding{L7x}\selectfont
ŲųĮįĄąĘę
\end{document}
\latintext was deprecated for a few reasons:
- failed often.
- It could switch the font even if unnecessary (most encodings include the ASCII range).
- The script is not enough, because you also need know the language (eg, for hyphenation).
To fix the issue with non-Greek text parts in Greek documents,
babel-greek 1.5 restores the previous default encoding with one exception: if the initial \encodingdefault
is LGR and the main language "greek", switch to \latinencoding
instead.
Unfortunately, the fix has to make use of the deprecated \latinencoding
macro, therefore I would like to know more about the issue of "failed often".
The other two issues don't apply for theuse in \noextrasgreek
:
The \noextrasgreek
fix could gain from the more advanced and robust determination of an ASCII-compatible font encoding in \ensureascii
--- if this encoding were accessible via a replacement for \latinencoding
.
For the completeness sake, it is not only about the hyphenation. It is about the font itself as well, check the following example and the differences.
Yes, pre-composed characters have several advantages. One more: drag and drop from the PDF generated by your example:
T1 U
˛u
˛ ˛i
I ˛ĄąĘę
L7x ŲųĮįĄąĘę
This also affects text search in the PDF.
One example where babel-greek using the \ensureascii
encoding instead of \latinencoding
would mean an improvement is hence
a document with Greek as main language and Lithuanian text parts and
\usepackage[L7x,LGR]{fontenc}
.
OTOH, the workaround to use \usepackage[LGR,L7x]{fontenc}
instead to get this right is so easy that this may not be much of an issue.
Edited: I mixed the problematic font encoding order and the fix. Corrected.
@gmilde I added two more lines in my original comment that resolve the copy/paste issues.
EDIT: Actually, they do not. Good point!
Problems related to how the T1
encoding renders some combining chars must be fixed elsewhere. The relevant point here is how encodings are selected by babel
. I’ve added \asciiencoding
, which stores the ASCII encoding as determined by babel
, so that it can be easily retrieved and modified (with commit https://github.com/latex3/babel/commit/5c746a2354ed7ffa5c443441adf6536d55a4aef6). I think there is no real need for \asciiensure
.
I'm closing this pull request because it has been merged (partially) by hand.
With existing ldf
and for backwards compatibility, we have to stay with what is, based on ad hoc solutions. The only real solution is to have each language select the right encoding, and it‘s too late for a change of this magnitude. Just saving and restoring is not enough as this pseudo-document shows:
Load T1, T5, T2A
Select Russian (as the main language)
Select Vietnamese || Select English
There is no encoding to switch back, and the new encoding isn’t known until either Vietnamese or English is selected.
@gmilde I just want to verify one thing (using Ų and LM, for example)...
One more: drag and drop from the PDF generated by your example:
T1 U ˛u ˛ ˛i I ˛ĄąĘę L7x ŲųĮįĄąĘę
This also affects text search in the PDF.
For L7x we have the following "chain":
\DeclareUnicodeCharacter{0172}{\k U}
=>
\DeclareTextCommand{\k}{L7x}[1]{\oalign{\null#1\crcr\hidewidth\char12}}
% the latter is then "overridden" by (which uses the pre-composed glyph for Ų)
\DeclareTextComposite{\k}{L7x}{U}{216}% "D8
=>
% and has a name
/enclml7x[... /Uogonek ...] % at position 216
On the other hand, for T1 we have only
\DeclareUnicodeCharacter{0172}{\k U}
=>
\DeclareTextCommand{\k}{T1}[1]{\hmode@bgroup\ooalign{\null#1\crcr\hidewidth\char12}\egroup}
% and there is no \DeclareTextComposite{\k} for U defined so the latter is used to mimic the Ų
% additionally, /enclmec[...] does not contain anything about Ų
which is insufficient.
On top of everything, glyphtounicode
does not actually bring anything with
\pdfglyphtounicode{Uogonek}{0172}
as Uogonek
(coming from /enclml7x[... /Uogonek ...]
) is a recognized glyph name (https://github.com/adobe-type-tools/agl-aglfn/blob/master/glyphlist.txt).
Am I right?
I don't know the details of glyphtounicode
.
For the font issues, your analysis is correct:
L7x has a slot for the Uogonek (and other letters with ogonek) as a separate character while
T1 uses a composition of two glyphs.
This is why selecting the correct font encoding matters.
For languages that use accented letters, I would recommend the ".ldf" file to switch to an encoding with full supporting (i.e. pre-composed letters in the font table) in `\extras` -- if this font encoding is defined (i.e. loaded with "fontenc") and switching back to the previous font encoding on exit. This way an document author can easily configure whether to use a font encoding with full character support or a font encoding with good font family support.
Example
\documentclass{article}
\usepackage[L7x,T1]{fontenc} % I want L7x for Polish text parts
\usepackage{lmodern}
\usepackage[polish,english]{babel}
...
vs.
\documentclass{article}
\usepackage{andika}
\usepackage[T1]{fontenc} % I want T1 for Polish text parts
\usepackage[polish,english]{babel}
...
Although Polish is covered by T1
, I got your point. However, this syntax is currently valid with another behavior, and changing it would break many documents. May be a package option or a macro to easy things, but I’d like to avoid adding many more options and macros for very specific situations.
I’m still ruminating about this whole fontenc
thing, but don’t expect too many changes, except for some warnings or infos (like your suggestion to warn about encodings required by a language, which would be certainly useful).
I thought it will be useful to have
\asciiensure
(similar to\latintext
). I also think\ensureascii
should be robust as it is not expandable (maybe some variant of\protected
will be preferable? I'm not really sure what is the best way to protect macros in LaTeX these days, but\DeclareRobustCommand
is how\latintext
is defined...)BTW, why is
\latintext
considered deprecated? just because it does not consider all font encodings or is there anything more fundamental?