Closed Davislor closed 3 years ago
It seems related to https://github.com/latex3/babel/wiki/What's-new-in-babel-3.44#example-2-combining-chars . I'll investigate but I'd say the problem is either in the engine or in the hyphenation patterns, not in babel
. I'm not sure if babel
must deal with this issue (but as the link above shows, it should be solvable with lua
; by the way, currently I'm working on \babelprehyphenation
which can be very slow, and expect a speed boost in the next release).
I think it is a problem of the patterns, at least I thought it four years ago: https://github.com/hyphenation/tex-hyphen/issues/5. And it fails with xelatex too, so a lua-only solution while nice imho isn't enough.
However, I see a problem if we consider it's a problem of the patterns, because we must repeat each pattern including a potential combining char for each possible combination. I'm not sure, but I'd say this would lead to a combinatorial explosion. I think the best approach would be to first normalize, before the hyphenation pass, and adjust patterns accordingly. Sadly, I think this is only feasible with luatex
.
I think you can avoid combinatorial explosion by just having simple high no-hyphenation pattern for each of the combining characters to prevent breaking before them. This may (will definitely) miss some patterns specific to the precomposed characters but would be better than allowing break between the base and the combining accent. Which I now see is what Arthur also suggests at the tex-hyphen issue that Ulrike links to above.
It’d be very convenient to have LaTeX export a function that checks the Unicode character properties, or to enable something like the PCRE \pM
syntax.
Failing that, the Unicode normalization tables list the following accents used in Greek with canonically-normalized decomposition:
U+0300 grave / varia U+0301 acute / oxia / tonos U+0304 macron U+0306 breve / vrachy U+0313 comma above / smooth breathing / psili U+0314 reversed comma above / rough breathing / dasia U+0342 circumflex / perispomeni U+0345 ypogegrameni / prosgegrameni U+0385 dieresis / dialytika
In addition, the beta code chart) lists some accents that are sometimes used to annotate ancient Greek texts, or at least can appear in documents converted from beta code:
U+0302 circumflex (for Latin letters) U+0305 overline U+0307 dot above U+030C caret / hacek U+031A treated as long U+0323 dot below U+0327 cedilla U+0328 ogonek U+032D circumflex below / inserted letter U+032F inverted breve below U+0332 underline U+0333 double underline U+0336 long stroke U+033D x above U+0359 asterisk below U+035C double breve below U+035D double breve U+035E double macron U+0361 double inverted breve U+0485 Cyrillic dasia / archaic rough breathing U+0486 Cyrillic psili / archaic smooth breathing U+1DC0 dotted grave U+1DC1 dotted acute U+1D242 treated as short
Most of these are for editorial marks or bibliographies, but the following rare symbols from beta code contain combining characters and should not be broken apart:
U+03A1 U+0336: Rho with long stroke U+039B U+0325: Lambda with ring below / drachma abbreviation U+2016 U+0334: Double line with tilde / unknown abbreviation U+006E U+030A: N with ring above / abbreviaton U+0375 U+0311: Greek lower numeral sign with breve / unknown abbreviation
@davidcarlisle
by just having simple high no-hyphenation pattern for each of the combining characters to prevent breaking before them
Sure, these patterns are necessary.
This may (will definitely) miss some patterns
That's what worries me. In LGR scripts good line breaking is essential for fine typography (and in the Unicode age line breaking is not just about applying hyphenation patterns, which is only one piece of the puzzle in localization issues).
@Davislor Do you have some samples? As I've said to David, line breaking is not only about hyphenation patterns (which is the main reason \babelposthyphenation
exists) and I would like to analyze if this issue belongs to fonts or to localization.
@jbezos I’m not sure what you’re requesting samples of. There is a large corpus of ancient Greek texts, in both Unicode and beta code, at Perseus.
@Davislor A text with these accents. With lots of them, if possible.
Sure, check out these texts.
I also included a sample document and its output PDF in my other post.
Note, however, that ancient Greek text on the Web will primarily, or entirely, use precomposed characters. You would need to convert them to NFD in order to test combining accents. (I have a Haskell program sitting around that does this.)
This may not be the right place to go into such details on this subject but:
@davidcarlisle wrote: "by just having simple high no-hyphenation pattern for each of the combining characters to prevent breaking before them"
Has anyone actually tried adding the patterns '8
and: "This may (will definitely) miss some patterns"
Nothing will 'get missed' by adding the extra '8 patterns' as suggested above. But they may never have been there!
Obviously, the corresponding patterns are needed for both normalisations; but that is true with or without the addition of these necessary 'blocking-8 patterns'.
@jbezos wrote: "line breaking is not only about hyphenation patterns".
Does this refer (mostly) to the "Unicode Line Breaking Algorithm"? Or are you considering other influences?
There is a macro in XeTeX engine which convert "in the fly" decomposed characted to precomposed (\XeTeXinputnormalization
), and a package for luatex doing the same (https://github.com/michal-h21/uninormalize).
@maieul Conversion at the input level is very likely the best option, but some things must be sorted out, like the ^^^^XXXX
notation. Is uninormalize
available on CTAN? I couldn't find it.
@car222222 Mostly a Unicode-based algorithm, but we may need additional rules for specific cases. For example: https://tex.stackexchange.com/questions/554760/apply-lefthyphenmin-to-parts-of-a-word-spelled-with-hyphens/554788#554788 . See also https://github.com/latex3/babel/wiki/Non%E2%80%93standard-hyphenation-with-luatex .
@jbezos for ctan, cf https://github.com/michal-h21/uninormalize/issues/1
@maieul @jbezos Note that normalisation to the precomposed form will not work in general because no new precomposed characters have been introduced for some time: 'no new letter combinations can be added'. The default is now that the decomposed string must be used.
The 'x-word' problem is a very old one, and it is language independent I think?
@car222222 It could be language dependent, if the left hypenmin is 1 (not usual, but not impossible either).
@car222222 I believe that, for Greek specifically, the precomposed characters are reasonably comprehensive. In general for an arbitrary language, I agree that this is not an adequate workaround. At the moment, decomposed characters do not work and precomposed characters do, but it would be best to support both.
@car222222 In most editors and OSs the preferred forms are the precomposed ones, with some exceptions in a few scripts. And often there is no choice. Anyway, according to the Unicode guidelines both are valid and therefore should be supported somehow.
@Davislor Since transformations are ‘mechanical’ and not language dependent, and there is at least a solution (thank you @maieul !), even if not in CTAN, I think this issue can be closed. But I'll keep it open in my mind, because I have dealt with this problem myself.
@jbezos the input-normalisation tools for lualatex was published on CTAN
I’ll pull out the bug report from the middle of my feature request. Greek hyphenation only supports precomposed characters. If you try to use decomposed characters, it will happily insert a hyphenation point between a base character and its combining accents, leaving the accents orphaned in the margin of the next line.
This violates the requirement of the Unicode standard that canonically-equivalent characters have the same display and behavior. Hyphenation should really be disabled before any Unicode combining character (at least by default).
Loading a different Greek definition file with
\babelprovide[import=el-poluton]{greek}
does not help.There is a similar bug in Polyglossia, if it makes you feel any better!