typst / typst

A new markup-based typesetting system that is powerful and easy to learn.
https://typst.app
Apache License 2.0
32.57k stars 871 forks source link

Remove apostrophes when hyphenating Turkish #2580

Open alerque opened 10 months ago

alerque commented 10 months ago

Turkish has an interesting feature where apostrophes inside words (which are very common) are valid break points, but the correct way to break at them is to remove them and replace them with the hyphen.

For cross references here is discussion on the same issue raised a few years back and solved in SILE.

Here is the torture test case adapted from SILE's test/bug-355:

#set page(
  paper: "a7",
  margin: (
    x: 30pt,
  )
)
#set par(
  justify: true,
)
#set text(
  lang: "tr",
  font: "Gentium Plus",
  size: 10pt,
)

// Some awkward apostrophes situations. Real words here but not real grammar.
// Margins chosen to hit as many unique awkward break points as possible
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.

This sample in Typst v0.9 has 6 bad breaks: 3 trailing apostrophes and 3 leading. Here is the view if you don't happen to have the font to get the metrics off hand:

20231103_22h42m40s_grim

Here is source for the same page in SILE:

\begin[papersize=a7]{document}
\nofolios
\neverindent
\set[parameter=shaper.variablespaces, value=true]
\set[parameter=linebreak.emergencyStretch,value=6pt]
\language[main=tr]

% Some awkward apostrophes situations. Real words here but not real grammar.
% Margins chosen to hit as many unique awkward break points as possible
\set[parameter=document.lskip,value=9.6%pw]
\set[parameter=document.rskip,value=9.6%pw]

İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
İstanbul'dan İsa'ya İzmir'de O'nu Müjdesi'nin İsa'nın Afyonkarahisar'danmış Can'a.
\end{document}

And here is what it looks like showing at least 4 valid cases of apostrophe replacements:

20231103_22h43m46s_grim

This occurred to be because of #2579 and affects the mechanics of how word tokenizing vs. hyphenation points vs. how they are actually output needs to happen internally.

alerque commented 10 months ago

By the way don't worry about this sample being a veritable festival of hyphenation. It does occur naturally in Turkish prose, but the copy and metrics used here were engineered as a torture test of needing to hyphenate at awkward places. Having more lines hyphenated than not is expected in this situation.

alerque commented 7 months ago

Apparently different years of the Turkish Language Institute's guidance treat this differently, as to different publisher's style guides. It isn't clear whether any of the differing years or publishers did so as an accommodation to what they could accomplish with existing tooling or not. This current page for example is recommending dropping the hyphen instead of the apostrophe in these cases.