hyphenation / tex-hyphen

Hyphenation patterns for TeX
53 stars 20 forks source link

Dynamically load 8-bit patterns into LuaLaTeX #7

Closed sgolovan closed 7 years ago

sgolovan commented 7 years ago

Hi!

There's a use case for the hyphenation patterns loader which isn't covered by the current code.

Sometimes I want to compile a legacy document using LuaLaTeX. The document (it's usually in Russian) uses T2A font encoding and some input encoding (cp1251 or utf-8). After replacing inputenc by luainputenc the relevant part of the preamble becaomes the following:

\usepackage[T2A]{fontenc}
\usepackage[utf8]{luainputenc}
\usepackage[russian]{babel}

The main problem with this setup is that babel dynamically loads the russian hyphenation patterns using the usual language.dat which in turn make it source loadhyph-ru.tex, and then since the TeX engine is Unicode-aware, it loads hyph-ru.tex in UTF-8 encoding, so hyphenation is essentially switched off as Russian letters in T2A encoding reside in different slots.

Locally, I use the following customized loadhyph-ru.tex, which checks the default font encoding (if it's set by fonenc.sty) to be in the Russian encodings list and loads the T2A patterns (designed for pTeX initially) if it's the case:

% filename: loadhyph-ru.tex
% language: russian
%
% Loader for hyphenation patterns, generated by
%     source/generic/hyph-utf8/generate-pattern-loaders.rb
% See also http://tug.org/tex-hyphen
%
% Copyright 2008-2016 TeX Users Group.
% You may freely use, modify and/or distribute this file.
% (But consider adapting the scripts if you need modifications.)
%
% Once it turns out that more than a simple definition is needed,
% these lines may be moved to a separate file.
%
\begingroup
\lccode`\-=`\-
% Test for pTeX
\ifx\kanjiskip\undefined
% Test for native UTF-8 (which gets only a single argument)
% That's Tau (as in Taco or ΤΕΧ, Tau-Epsilon-Chi), a 2-byte UTF-8 character
\def\testengine#1#2!{\def\secondarg{#2}}\testengine Τ!\relax
\ifx\secondarg\empty
    % Unicode-aware engine (such as XeTeX or LuaTeX) only sees a single (2-byte) argument
    \catcode`\@=11
    \ifx\in@\undefined
        % initex
        \let\unicode\relax
    \else
        % LuaLaTeX is loading patterns dynamically
        \ifx\encodingdefault\undefined
            % Font encoding isn't specified
            \let\unicode\relax
        \else
            \expandafter\in@\expandafter{\encodingdefault}{T2A,T2B,T2C,X2}%
            \ifin@
                % Default font encoding is 8-bit, so load t2a patterns
            \else
                \let\unicode\relax
            \fi
        \fi
    \fi
    \catcode`\@=12
    \ifx\unicode\relax
        \message{UTF-8 Russian hyphenation patterns}
        \input hyph-ru.tex
    \else
        \message{T2A Russian hyphenation patterns}
        \input hyph-ru.t2a.tex
    \fi
\else
    % 8-bit engine (such as TeX or pdfTeX)
    \message{T2A Russian hyphenation patterns}
    % The old system allows choosing patterns and encodings manually. That mechanism needs to be implemented first in this package, so we still fall back on old system.
    \input ruhyphen.tex
\fi\else
    % pTeX
    \message{T2A Russian hyphenation patterns}
    \input hyph-ru.t2a.tex
\fi
\endgroup

So I'd like to ask if this approach makes sense, and if it could be done for all the hyphenation loaders to make things work without local changes. Or maybe there's some other way.

mojca commented 7 years ago

I would suggest to raise this question on the mailing list. Either tex-hyphen or kadingira or some lualatex list. Arthur knows more, but I'm illiterate in LaTeX and some feedback from Babel developer would be welcome.

But I'm slightly confused. Doesn't LuaLaTeX use a different mechanism for loading the patterns? I thought it used plain text patterns directly. Then again, I did not follow the recent changes in Babel too closely.

sgolovan commented 7 years ago

Afaik, LuaLaTeX can load patterns both ways, via a Lua hook or the old way, sourcing a file with hyphenation patterns in it. Anyway, someone has to tell it which patterns to use for a given language. And I know two implementations: 1) polyglossia makes LuaTeX use the text patterns (hyph-ru.pat.txt for Russian), 2) Babel just loads loadhyph-ru.tex.

Legacy documents don't use Polyglossia, making only Babel to need both UTF-8 and 8-bit patterns.

I'll ask this question in a mailing list, thank you for the suggestion.

reutenauer commented 7 years ago

This is a Babel issue, since as you’re aware no patterns are dumped into the LuaLaTeX format (except for hyphen.tex), and it’s thus up to packages to decide what to do. Javier chose to use language.dat directly, without explaining why, and I think it would be good to discuss it with him. The Kadingira list is probably the best place.

reutenauer commented 7 years ago

Closing the issue.

mojca commented 7 years ago

Indeed. Please ask the author of Babel or ask for help on stackexchange. I know that XeTeX had a mechanism to map old fonts to the proper Unicode slots which is much cleaner at the end, ConTeXt did something similar in the early days of Unicode. Handling the problem at the patterns end would be the wrong place for the fix, also because we don't know when the user might change the font and we don't really know the desired encoding. This needs proper support on the LaTeX/Babel end.