8-bit font encodings in *.ini files

gmilde commented 8 months ago

There are several issues with the font encoding switches for languages "imported" from *.ini files or used "on-the-fly"

Some languages currently listed as supporting OT1 use characters or accents that are not supportd by OT1. Examples are Polish and Lithuanian, there may be others where OT1 should be stripped from the "encodings" list.
Correct hyphenation under 8-bit TeX is only supported for the font encoding used in the pattern file (or compatible ones).¹ This font encoding should be preferred².

¹ There is a list of "8-bit hyphenation encodings" in the "Languages" table in https://hyphenation.org/index.html. It may be a bit outdated and incomplete but may serve as a start. The In the table, EC stands for T1 and ASCII for "OT1 or ASCII compatible" (languages that don't use accented characters). It may be helpfull to get in touch with the maintainers of https://www.ctan.org/pkg/hyph-utf8.

² It depends on the specific case, whether a font encoding switch is actually better or not: As some of these encodings have limited font support, a secondary choice may be OK for single words or short quotes. Marking up a text part in a different language may just serve external issues (spellcheck) but could also imply the authors intention to get correct hyphenation and correct representation of "exotic" characters. Loading a font encoding with "fontenc" may serve as an indictor that it should be used for languages that have it as "first choice".

The following example shows the problems: Two errors due to the ogonek accent, wrong characters for the "comma below" accent, suboptimal font encoding choices with respect to hyphenation and use of pre-composed characters.

\documentclass[english]{article}
\usepackage{parskip}
\usepackage[T1,T2A,L7x,QX,OT1]{fontenc}
\usepackage{babel}
\makeatletter
\begin{document}

Default language,  
current font encoding \cf@encoding,
default font encoding \encodingdefault.

\foreignlanguage{bulgarian}{Български,
current font encoding \cf@encoding.}

\foreignlanguage{polish}{Język polski,
current font encoding \cf@encoding.
The \emph{ogonek} accent fails with OT1 (QX, L7x, T1, T2A work).
Correct hyphenation requires QX but font support is limited.}

\foreignlanguage{lithuanian}{Lietuvių kalba,
current font encoding \cf@encoding.
The \emph{ogonek} accent fails with OT1 (QX, L7x, T1, T2A work).
Correct hyphenation requires L7x but font support is limited.}

\foreignlanguage{latvian}{Latviešu valoda,
current font encoding \cf@encoding.
Correct rendering of \emph{komma below} accent (Ģ Ķ Ļ Ņ ģ ķ ļ ņ)
requires L7x or T1,
correct hyphenation requires L7x but font support is limited.}

\end{document}

ivankokan commented 8 months ago

Some languages currently listed as supporting OT1 use characters or accents that are not supportd by OT1. Examples are Polish and Lithuanian, there may be others where OT1 should be stripped from the "encodings" list.

@gmilde @jbezos I guess that Croatian is one of them. However, there are \dj and \DJ implemented in the old days to address this issue, so eventually I think that OT1 can be left in the encodings list:

encodings = T1 OT1 LY1

On the other hand, I am not familiar with LY1. Please clarify.

jbezos commented 8 months ago

@gmilde Selecting OT1 as the main encoding and therefore the preferred one effectively makes many other Latin encodings no-op (T4 and T5 are exceptions), so I wonder if it makes sense, considering T1 (and iir, LY1) is a superset. But clearly OT1 cannot be included in the list for Polish and Lithuanian, because there is no glyph for the ogonek at all (there are other languages with ogonek, like Icelandic and Navajo).

gmilde commented 8 months ago

Selecting OT1 as the main encoding and therefore the preferred one effectively makes many other Latin encodings no-op (T4 and T5 are exceptions), so I wonder if it makes sense, considering T1 (and iir, LY1) is a superset.

IMO, a font encoding switch suggests itself, if the current encoding only provides a subset of the required encoding.

For every language, we may distinguish two sets of encodings:¹

canonical (hyphenation works, drag-and-drop works, characters are correctly represented in print) and
substitute (no compilation errors but some characters are composites leading to omissions in hyphenation and possibly errors with drag-and-drop/search from the PDF).

We should consider a suitable way to represent these sets in the *.ini files.

Both sets may contain more than one font encoding with variations outside the letters actually used in the respective language.

We may use some term or external list for supersets, e.g.

"canonical" OT1 implies that all standard text encodings and also non-standard but ASCII-compatible ones are "canonical" too (https://hyphenation.org uses the qualifier "ASCII").
"canonical" T1 implies that all standard text encodings (as well as LY1 and probably some more) should work as "substitute" font encodings.

¹There is a grey zone when composite representations have wrong accent glyphs (like Romanian/Latvian characters with comma below in OT1) or misplaced accents (like the comma below in T1).

But clearly OT1 cannot be included in the list for Polish and Lithuanian, because there is no glyph for the ogonek at all (there are other languages with ogonek, like Icelandic and Navajo).

Currently, 173 babel/locale/*.ini files contain OT1 in the "encodings" list. All of them should be tested for OT1 compatibility. (There is no ogonek in Icelandic but thorn and eth fail with OT1, hyphenation requires T1.)

@ivankokan All non-ASCII chars used in Croation (č ć ǆ đ ǉ ǌ š ž Č Ć Ǆ ǅ Đ Ǉ ǈ Ǌ ǋ Š Ž) work with OT1 while hyphenation requires T1. (The double-letters are automatically decomposed here but the legacy characters work fine in my example file). LY1 is an alternative to the T1 encoding developed by Y&Y and used in their commercial TEX implementation. "encguide.pdf" has an encoding table. For many western European languages is a "canonical" encoding. For Croatian, it can be used as "substitute" encoding (as can OT1, T2A and others).

IMO, Babel's on-the-fly/imported languages should

switch to a canonical font encoding if one is declared in the document.
Otherwise, emit a warning (suggesting to declare one of ) and select a known substitute font encoding.
If no substitute font encoding is declared, emit a warning (suggesting to declare one of or at least ).

jbezos commented 8 months ago

A choice for the default behavior must be made – prioritize either font or hyphenation. There are ~50 fd files for QX vs. ~800 for T1, and the manual for babel-polish doesn’t even mention the former. The current rules prioritize fonts because a sudden change is usually meaningful, at the cost of some missing hyphens. The real limitation is the selected encoding must render all characters (thorn, eth, ogonek, schwa, eng, etc.). If necessary, the preferred encoding for a language can be set by users, if the font is not a problem or for whatever reason they want.

ivankokan commented 8 months ago

@ivankokan All non-ASCII chars used in Croation (č ć ǆ đ ǉ ǌ š ž Č Ć Ǆ ǅ Đ Ǉ ǈ Ǌ ǋ Š Ž) work with OT1 while hyphenation requires T1.

@gmilde @jbezos So, the issue with Croatian and OT1 is 99 % with the hyphenation (the other 1 % is about missing Đđ which is at least handled somehow)? I leave you two to decide whether the OT1 should be excluded or not (indifferent on this matter but generally tend to be strict :D).

gmilde commented 8 months ago

@ivankokan

[...] missing Đđ which is at least handled somehow)?

Đđ are handled exactly like the other "adorned" characters: T1 has slots for pre-composed characters while in OT1 they are created by superposition of the base character and adornment (haček, acute, stroke, ...). The same holds for, e.g. German umlauts (äöü) and French letters with grave and circomflex. The legacy ligatures are converted to two characters (like in Unicode) already by "inputenc" (cf. utf8enc.dfu).

This makes T1 the preferred font encoding for these languages and OT1 a "compatibility font encoding" (it works with some drawbacks).

gmilde commented 8 months ago

If necessary, the preferred encoding for a language can be set by users, if the font is not a problem or for whatever reason they want.

Loading a font encoding in the document preamble can be interpreted as a statement that the document author wants to use this font encoding at some place in the document.

This is why I propose to switch the preferred font encoding for text parts in a "foreign" language if this font encoding is known. (Avoiding the font-encoding switch is as easy as deleting the respective font encoding from the list of "fontenc" arguments.)

The real limitation is the selected encoding must render all characters (thorn, eth, ogonek, schwa, eng, etc.).

This is why I propose to switch to an "ersatz" font encoding, if the preferred font encoding is not known and the current font encoding not in the list of compatible font encodings. If no compatible font encoding is declared, write a warning and try with the current font encoding (maybe it works because the missing characters are not used or the document provides some other workaround). In case of a compilation error, the combination of the actual error message and the preceding warning can give the user sensible feedback.

latex3 / babel

8-bit font encodings in *.ini files #267