sile-typesetter / sile

The SILE Typesetter — Simon’s Improved Layout Engine
https://sile-typesetter.org
MIT License
1.65k stars 98 forks source link

Hyphenation on words joined with hyphen in Polish #1960

Closed Omikhleia closed 8 months ago

Omikhleia commented 8 months ago

Rumour has it that in Polish, when a word containing a hyphen is cut at that point, the hyphen must be repeated on the next line -- and that several typesetting systems have unaddressed enhancement requests on that topic:

Typst (issue in 2024): https://github.com/typst/typst/issues/3235 OpenOffice (issue in 2006): https://bz.apache.org/ooo/show_bug.cgi?id=71679

If this typography convention for Polish is correct, then SILE too is currently wrong:

\begin[papersize=a5]{document}
\language[main=pl]
\lua{
    for i = 1,100 do
       SILE.typesetter:typeset("biało-czerwony ")
    end
}
\end{document}

image

Except that it would be a fairly trivial thing to fix... Just a few lines of code, possibly: (EDIT: Well, a bit more, see PR)

-- Put this in languages/pl.lua
SILE.nodeMakers.pl = pl.class(SILE.nodeMakers.unicode)

function SILE.nodeMakers.pl:handleWordBreak (item)
  if item.text == "-" then
    self:addToken(item.text, item)
    self:makeToken()
    coroutine.yield(SILE.nodefactory.discretionary({
      postbreak = SILE.shaper:createNnodes("-", self.options)
    }))
  else
    self._base.handleWordBreak(self, item)
  end
end

image

Yay! :smile:


I could make a PR, but the details are in the devil. Here I stack the "-" onto the previous word, and insert a postbreak discretionary. But we could also stack the "-" on the next word, or even wholly ignore it (of course, each time using an adequate postbreak/prebreak/replacement discretionary).

The question at stakes is how is supposed to be hyphenated the first word?

With the above fix, we get bia•ło-•czer•wony (I am marking the hyphenation points with • to distinguish them from the word's hyphen) ... because SILE.typesetter:typeset(SILE.showHyphenationPoints("biało-, "pl")) :arrow_right: bia•ło-

But note that SILE.showHyphenationPoints("biało", "pl") :arrow_right: biało (EDIT: no hyphenation point currently) ... so depending on how we do it, we can get different hyphenations...

It seems to me that bia•ło-•czer•wony might be correct, but we'd need a Polish friend to confirm the expectations :poland:

EDIT I corrected SILE.showHyphenationPoints("czerwony", "pl") :arrow_right: czer•wony (not czer•wo•ny) with default settings.

Omikhleia commented 8 months ago

Perhaps it's not too impolite to ask @jakubkaczor (who opened the above-mentioned Typst issue) on that matter?

Omikhleia commented 8 months ago

If this typography convention for Polish is correct,

Some 2022 (LaTeX) Babel for Polish manual also mentions it: "According to Polish rules, when a break occurs at an explicit hyphen, the hyphen gets repeated at the beginning of the new line."

(Of course they "fix" it by requiring the user to typeset some specific markup, with active "catcodes"... Sigh.)

jakubkaczor commented 8 months ago

It is not impolite at all to ask me. If I understood the question correctly, you wonder whether there should be any hyphenation points in the word between a hyphen if it occurs. I am not knowledgeable enough in the topic, but I can link some sources.

I believe the most common package, and the one I used, for correcting hyphenation in LaTeX is polski package. The author provides commands for hyphen (dywiz), en-dash (ppauza), and em-dash (pauza). These are explained in the (english!) documentation for the package. As far as I understand it, the author uses \kern to allow hyphenation in words between and after hyphen, so the answer would be: yes, the first word should also be considered for hyphenation. Please, correct me if I am misunderstanding. As far as I am informed, the active author is a professional typographer, member of the GUST and one of the authors of the TeX Gyre fonts.

Omikhleia commented 8 months ago

@jakubkaczor Many thanks! Indeed on p. 11 of the document you mentioned: "... allow both parts of the word to be considered for hyphenation"

That answers my question. (I can't understand the 0-valued kerns in TeX, but whatever, the conclusion is the key.)

Omikhleia commented 8 months ago

Apparently also in Czech

alerque commented 8 months ago

Thanks for looking into this @Omikhleia, and thanks for the feedback @jakubkaczor. This should be working properly in the next patch release. It might even be worth adding an example to the website to showcase this. I could then add an Turkish example too so we have some samples of how atypical hyphenation rules are or can be handled.