sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine
https://sile-typesetter.org
MIT License
1.61k stars 97 forks source link

Hyphenation minimal word length and casing are not UTF8-compliant #2018

Closed Omikhleia closed 1 month ago

Omikhleia commented 2 months ago

Issue

Relates to #2017 regarding the hard-coded minWord = 5 value, but it's however a different type of issue here:

The logic is not UTF-8 compliant:

https://github.com/sile-typesetter/sile/blob/b2cc0841ff603abc335c5e66d8cc3c64b65365eb/core/hyphenator-liang.lua#L58-L63

Proofs / Minimal examples

The second case here, with minWord at 6, would be expected not to hyphenate "léris":

> SILE.showHyphenationPoints("léris", "fr")
lé-ris
> SILE._hyphenators["fr"].minWord
5
> SILE._hyphenators["fr"].minWord = 6
> SILE.showHyphenationPoints("léris", "fr")
lé-ris
> -- OOPS. "léris" is 5-character long (but 6-byte long)
> SILE._hyphenators["fr"].minWord = 7
> SILE.showHyphenationPoints("léris", "fr")
léris

We override a pattern below, but it doesn't work with an uppercase input (bypassing the exception).

> SILE.call("hyphenator:add-exceptions", { lang="fr" }, { "légè-rement" })% Override as exception
> SILE.showHyphenationPoints("légèrement", "fr")
légè-rement
> SILE.showHyphenationPoints("LÉGÈREMENT", "fr")
LÉGÈ-RE-MENT
> -- OOPS, expected "LÉGÈ-REMENT"