latex3 / babel

The multilingual framework to localize LaTeX, LuaLaTeX and XeLaTeX
https://latex3.github.io/babel/
LaTeX Project Public License v1.3c
130 stars 35 forks source link

Hyphenation in Greek orphans combining accents. #107

Closed Davislor closed 3 years ago

Davislor commented 3 years ago

I’ll pull out the bug report from the middle of my feature request. Greek hyphenation only supports precomposed characters. If you try to use decomposed characters, it will happily insert a hyphenation point between a base character and its combining accents, leaving the accents orphaned in the margin of the next line.

\documentclass{article}
\usepackage[greek.ancient]{babel}
\usepackage{fontspec}

\babelfont{rm}[Ligatures={Common,Discretionary}]{Libertinus Serif}
\pagestyle{empty}

\begin{document}
\parbox{0pt}{
Hyphenate: % The first word in a paragraph is unhyphenated.
^^^^1faf^^^^03a9^^^^0314^^^^0342^^^^0345
}
\end{document}

hyphtest

This violates the requirement of the Unicode standard that canonically-equivalent characters have the same display and behavior. Hyphenation should really be disabled before any Unicode combining character (at least by default).

Loading a different Greek definition file with \babelprovide[import=el-poluton]{greek} does not help.

There is a similar bug in Polyglossia, if it makes you feel any better!

jbezos commented 3 years ago

It seems related to https://github.com/latex3/babel/wiki/What's-new-in-babel-3.44#example-2-combining-chars . I'll investigate but I'd say the problem is either in the engine or in the hyphenation patterns, not in babel. I'm not sure if babel must deal with this issue (but as the link above shows, it should be solvable with lua; by the way, currently I'm working on \babelprehyphenation which can be very slow, and expect a speed boost in the next release).

u-fischer commented 3 years ago

I think it is a problem of the patterns, at least I thought it four years ago: https://github.com/hyphenation/tex-hyphen/issues/5. And it fails with xelatex too, so a lua-only solution while nice imho isn't enough.

jbezos commented 3 years ago

However, I see a problem if we consider it's a problem of the patterns, because we must repeat each pattern including a potential combining char for each possible combination. I'm not sure, but I'd say this would lead to a combinatorial explosion. I think the best approach would be to first normalize, before the hyphenation pass, and adjust patterns accordingly. Sadly, I think this is only feasible with luatex.

davidcarlisle commented 3 years ago

I think you can avoid combinatorial explosion by just having simple high no-hyphenation pattern for each of the combining characters to prevent breaking before them. This may (will definitely) miss some patterns specific to the precomposed characters but would be better than allowing break between the base and the combining accent. Which I now see is what Arthur also suggests at the tex-hyphen issue that Ulrike links to above.

Davislor commented 3 years ago

It’d be very convenient to have LaTeX export a function that checks the Unicode character properties, or to enable something like the PCRE \pM syntax.

Failing that, the Unicode normalization tables list the following accents used in Greek with canonically-normalized decomposition:

U+0300 grave / varia U+0301 acute / oxia / tonos U+0304 macron U+0306 breve / vrachy U+0313 comma above / smooth breathing / psili U+0314 reversed comma above / rough breathing / dasia U+0342 circumflex / perispomeni U+0345 ypogegrameni / prosgegrameni U+0385 dieresis / dialytika

In addition, the beta code chart) lists some accents that are sometimes used to annotate ancient Greek texts, or at least can appear in documents converted from beta code:

U+0302 circumflex (for Latin letters) U+0305 overline U+0307 dot above U+030C caret / hacek U+031A treated as long U+0323 dot below U+0327 cedilla U+0328 ogonek U+032D circumflex below / inserted letter U+032F inverted breve below U+0332 underline U+0333 double underline U+0336 long stroke U+033D x above U+0359 asterisk below U+035C double breve below U+035D double breve U+035E double macron U+0361 double inverted breve U+0485 Cyrillic dasia / archaic rough breathing U+0486 Cyrillic psili / archaic smooth breathing U+1DC0 dotted grave U+1DC1 dotted acute U+1D242 treated as short

Most of these are for editorial marks or bibliographies, but the following rare symbols from beta code contain combining characters and should not be broken apart:

U+03A1 U+0336: Rho with long stroke U+039B U+0325: Lambda with ring below / drachma abbreviation U+2016 U+0334: Double line with tilde / unknown abbreviation U+006E U+030A: N with ring above / abbreviaton U+0375 U+0311: Greek lower numeral sign with breve / unknown abbreviation

jbezos commented 3 years ago

@davidcarlisle

by just having simple high no-hyphenation pattern for each of the combining characters to prevent breaking before them

Sure, these patterns are necessary.

This may (will definitely) miss some patterns

That's what worries me. In LGR scripts good line breaking is essential for fine typography (and in the Unicode age line breaking is not just about applying hyphenation patterns, which is only one piece of the puzzle in localization issues).

jbezos commented 3 years ago

@Davislor Do you have some samples? As I've said to David, line breaking is not only about hyphenation patterns (which is the main reason \babelposthyphenation exists) and I would like to analyze if this issue belongs to fonts or to localization.

Davislor commented 3 years ago

@jbezos I’m not sure what you’re requesting samples of. There is a large corpus of ancient Greek texts, in both Unicode and beta code, at Perseus.

jbezos commented 3 years ago

@Davislor A text with these accents. With lots of them, if possible.

Davislor commented 3 years ago

Sure, check out these texts.

Davislor commented 3 years ago

I also included a sample document and its output PDF in my other post.

Davislor commented 3 years ago

Note, however, that ancient Greek text on the Web will primarily, or entirely, use precomposed characters. You would need to convert them to NFD in order to test combining accents. (I have a Haskell program sitting around that does this.)

car222222 commented 3 years ago

This may not be the right place to go into such details on this subject but:

@davidcarlisle wrote: "by just having simple high no-hyphenation pattern for each of the combining characters to prevent breaking before them"

Has anyone actually tried adding the patterns '8' for all Unicode combining chars (for any language?). My understanding is that this will always work unless some one has perversely added some '9 patterns'! And I imagine that no such would occur 'naturally'.

and: "This may (will definitely) miss some patterns"

Nothing will 'get missed' by adding the extra '8 patterns' as suggested above. But they may never have been there!

Obviously, the corresponding patterns are needed for both normalisations; but that is true with or without the addition of these necessary 'blocking-8 patterns'.

car222222 commented 3 years ago

@jbezos wrote: "line breaking is not only about hyphenation patterns".

Does this refer (mostly) to the "Unicode Line Breaking Algorithm"? Or are you considering other influences?

maieul commented 3 years ago

There is a macro in XeTeX engine which convert "in the fly" decomposed characted to precomposed (\XeTeXinputnormalization), and a package for luatex doing the same (https://github.com/michal-h21/uninormalize).

jbezos commented 3 years ago

@maieul Conversion at the input level is very likely the best option, but some things must be sorted out, like the ^^^^XXXX notation. Is uninormalize available on CTAN? I couldn't find it.

jbezos commented 3 years ago

@car222222 Mostly a Unicode-based algorithm, but we may need additional rules for specific cases. For example: https://tex.stackexchange.com/questions/554760/apply-lefthyphenmin-to-parts-of-a-word-spelled-with-hyphens/554788#554788 . See also https://github.com/latex3/babel/wiki/Non%E2%80%93standard-hyphenation-with-luatex .

maieul commented 3 years ago

@jbezos for ctan, cf https://github.com/michal-h21/uninormalize/issues/1

car222222 commented 3 years ago

@maieul @jbezos Note that normalisation to the precomposed form will not work in general because no new precomposed characters have been introduced for some time: 'no new letter combinations can be added'. The default is now that the decomposed string must be used.

car222222 commented 3 years ago

The 'x-word' problem is a very old one, and it is language independent I think?

jbezos commented 3 years ago

@car222222 It could be language dependent, if the left hypenmin is 1 (not usual, but not impossible either).

Davislor commented 3 years ago

@car222222 I believe that, for Greek specifically, the precomposed characters are reasonably comprehensive. In general for an arbitrary language, I agree that this is not an adequate workaround. At the moment, decomposed characters do not work and precomposed characters do, but it would be best to support both.

jbezos commented 3 years ago

@car222222 In most editors and OSs the preferred forms are the precomposed ones, with some exceptions in a few scripts. And often there is no choice. Anyway, according to the Unicode guidelines both are valid and therefore should be supported somehow.

@Davislor Since transformations are ‘mechanical’ and not language dependent, and there is at least a solution (thank you @maieul !), even if not in CTAN, I think this issue can be closed. But I'll keep it open in my mind, because I have dealt with this problem myself.

maieul commented 3 years ago

@jbezos the input-normalisation tools for lualatex was published on CTAN