Missing line breaking for CJK in LuaLaTeX

reutenauer / polyglossia

An alternative to Babel for XeLaTeX and LuaLaTeX

http://www.ctan.org/pkg/polyglossia

MIT License

185 stars 52 forks source link

Missing line breaking for CJK in LuaLaTeX #635

Open pcdi opened 2 months ago

pcdi commented 2 months ago

I am not able to get line breaking to work with CJK languages with LuaLaTeX. Is this intended behavior? The polyglossia documentation states that it is provides a mechanism for “Loading the appropriate hyphenation patterns” (p. 4). Even though CJK languages do not have hyphenation, I was hoping for some kind of basic line breaking functionality, even if not as sophisticated as specialized packages such as ctex or luatexja might be able to provide.

In XeTeX, this kind of basic line breaking was possible by including something along the lines of:

\XeTeXlinebreaklocale "zh"
\XeTeXlinebreakskip = 0pt plus 1pt minus 0.1pt

However, this is obviously not possible in LuaTeX. Is there any way to get CJK line breaking with polyglossia only?

The following example illustrates the problem:[^1] [^1]: Test string are the 68 URO code points that have five unique glyphs in Source Han Sans/Serif, see the font readme under “Glyph Sharing Statistics”, p. 16.

Version info:

LuaHBTeX, Version 1.18.0 (TeX Live 2024)
LaTeX2e <2023-11-01> patch level 1
L3 programming layer <2024-03-14>
Polyglossia 2.1, rev. 70491 (TeX Live)
Fontspec 2.9a, rev. 69833 (TeX Live)

\documentclass{article}

\usepackage{polyglossia}

\newcommand*{\teststring}{%
    傑僭割劘匾叟喝塌姿嬴幰廋扇扉搨摩榻溲潛瀛瘦瞎磨窖竇箭篠簉糙綢纛羸翁翦%
    翩肓臝艘花裯褐謁譖豁贏轄返迷途造週遍遭選遼鄰釁閼雕靠靡颼飯驎鬣魔麗麟%
}

\setdefaultlanguage{english}

\setotherlanguage{chinese}
\newfontfamily\chinesefont[Script=CJK,Language=Chinese Simplified]{Source Han Sans SC}

\setotherlanguage{japanese}
\newfontfamily\japanesefont[Script=CJK,Language=Japanese]{Source Han Sans}

\begin{document}

\textchinese{\teststring}

\begin{chinese}
    \teststring\par
\end{chinese}

\textjapanese{\teststring}

\begin{japanese}
    \teststring\par
\end{japanese}

\end{document}

Output image

![Polyglossia-CJK](https://github.com/reutenauer/polyglossia/assets/4654151/94939c74-ce67-432b-bc68-1be043116061)

Line breaking does work if luatexja-fontspec is loaded, however this also has many unintended side effects. For example,

‎2018 LEFT SINGLE QUOTATION MARK
‎2019 RIGHT SINGLE QUOTATION MARK
‎201C LEFT DOUBLE QUOTATION MARK
‎201D RIGHT DOUBLE QUOTATION MARK
‎201E DOUBLE LOW-9 QUOTATION MARK

and so on are interpreted as CJK punctuation and thus the font is switched to a CJK font regardless whether or not the text is inside a CJK language environment defined through polyglossia, which means simply using luatexja-fontspec alongside polyglossia is not straightforward. Luatexja-fontspec also redefines fonts by itself, which runs counter to the intention of using polyglossia/fontspec to define font-language pairs in the first place.

\documentclass{article}

\usepackage[]{luatexja-fontspec}
% \usepackage[sourcehan]{luatexja-preset}

\usepackage{polyglossia}

\newcommand*{\teststring}{%
    傑僭割劘匾叟喝塌姿嬴幰廋扇扉搨摩榻溲潛瀛瘦瞎磨窖竇箭篠簉糙綢纛羸翁翦%
    翩肓臝艘花裯褐謁譖豁贏轄返迷途造週遍遭選遼鄰釁閼雕靠靡颼飯驎鬣魔麗麟%
}

\setdefaultlanguage{english}

\setotherlanguage{chinese}
\newjfontfamily\chinesefont[Script=CJK,Language=Chinese Simplified]{Source Han Sans}

\setotherlanguage{japanese}
\newjfontfamily\japanesefont[Script=CJK,Language=Japanese]{Source Han Sans}

\begin{document}

\textchinese{\teststring}

\begin{chinese}
    \teststring\par
\end{chinese}

\textjapanese{\teststring}

\begin{japanese}
    \teststring\par
\end{japanese}

\end{document}

Output image

![Luatex-ja-CJK](https://github.com/reutenauer/polyglossia/assets/4654151/1f85a056-f4d9-4fd8-8b76-c8e3f52294de)

pcdi commented 2 months ago

After checking the commit that introduced support (acfa1744a80bebc536c4cf89a9636164ca84d419), it seems that the current support is indeed dependent on XeTeX:

https://github.com/reutenauer/polyglossia/blob/ce7c1968b870210afa563137f98a78d082086f4f/doc/example-chinese.tex#L15-L16

https://github.com/reutenauer/polyglossia/blob/ce7c1968b870210afa563137f98a78d082086f4f/doc/example-japanese.tex#L9-L10

jspitz commented 1 month ago

@t-tk is there a reason why \XeTeXlinebreaklocale and \XeTeXlinebreakskip are not set in gloss-chinese directly? @dohyunkim is there something we can do with LuaTeX?

dohyunkim commented 1 month ago

polyglossia-korean.lua provides CJK line-breaking functionality. It is intended to be used with Korean texts, but can be abused for Chinese or Japanese. To load this lua file, we have to declare two attributes in advance:

\makeatletter
\newattribute\xpg@attr@korean \xpg@attr@korean=1
\newattribute\xpg@attr@autojosa 
\makeatother
\directlua{ require"polyglossia-korean" }

The latter one xpg@attr@autojosa is irrelevant to Chinese or Japanese. The key is the first one xpg@attr@korean, and we can give 0, 1, or 2 as its value. 1 is for classical documents which have no spaces between words, it will be more suitable than others for Chinese or Japanese texts. Of course I can't guarantee that the result will be reasonble or at least acceptable to the eyes of native Chinese or Japanese people.

pcdi commented 1 month ago

Based on a quick comparison of WP: Line breaking rules in East Asian languages with polyglossia-korean.lua, it seems that the basic line breaking rules are reasonably similar if not identical between Korean and Chinese/Japanese.

Amending the above example with the code provided by @dohyunkim yields an acceptable result, as can be seen below. Punctuation is also put in the right position at the beginning or end of a line, as expected.

After checking Ken Lunde's CJKV Information Processing, “Line Breaking and Word Wrapping,” p. 352–355, it seems there are more sophisticated ways to achieve line breaking in CJK (such as hanging punctuation or keeping the text fully aligned to the character grid), but I am unsure if this is something polyglossia should be doing. I would be interested in your opinion on that. The first edition of CJKV Information Processing is also available as a free loan on the Internet Archive here (need to register for a free account, though). See for example p. 345 for an explanation of the character grid.

The fonts are rescaled in the following example to match the output by luatex-ja above, as luatex-ja scales the fonts to 0.962216 by default (see sections 2.3, 2.4 of the luatex-ja manual).

\documentclass{article}

\usepackage{polyglossia}

\makeatletter
\newattribute\xpg@attr@korean \xpg@attr@korean=1
\newattribute\xpg@attr@autojosa 
\makeatother
\directlua{ require"polyglossia-korean" }

\newcommand*{\teststring}{%
    傑僭割劘匾叟喝塌姿嬴幰廋扇扉搨摩榻溲潛瀛瘦瞎磨窖竇箭篠簉糙綢纛羸翁翦%
    翩肓臝艘花裯褐謁譖豁贏轄返迷途造週遍遭選遼鄰釁閼雕靠靡颼飯驎鬣魔麗麟%
}

\newcommand*{\teststringpunct}{%
    傑僭割劘匾叟喝塌姿嬴幰廋扇扉搨摩榻溲潛瀛瘦瞎磨窖竇箭篠簉糙綢纛羸翁。（翦）。%
    翩肓臝艘花裯褐謁譖豁贏轄返迷途造週遍遭選遼鄰釁閼雕靠靡颼飯驎鬣魔麗麟%
}

\setdefaultlanguage{english}

\setotherlanguage{chinese}
\newfontfamily\chinesefont[Script=CJK,Language=Chinese Simplified,Scale=0.962216]{Source Han Sans SC}

\setotherlanguage{japanese}
\newfontfamily\japanesefont[Script=CJK,Language=Japanese,Scale=0.962216]{Source Han Sans}

\begin{document}

\textchinese{\teststring}

\begin{chinese}
    \teststring\par
\end{chinese}

\textjapanese{\teststring}

\begin{japanese}
    \teststring\par
\end{japanese}

\textchinese{\teststringpunct}

\begin{chinese}
    \teststringpunct\par
\end{chinese}

\textjapanese{\teststringpunct}

\begin{japanese}
    \teststringpunct\par
\end{japanese}

\end{document}

Output image

![Linebreaking-Punct](https://github.com/reutenauer/polyglossia/assets/4654151/ec3ab7d9-f712-45ac-acee-06269b13bdaf)

pcdi commented 1 month ago

On further testing, I noticed that polyglossia-korean.lua inserts spaces in unwanted positions, specifically outside of text that is tagged as CJK. For example, it inserts a space in between a (half-width) parenthesis and quotation marks if they are directly adjacent (with or without csquotes). You can also test with \xpg@attr@korean=0, which solves the problem by not inserting any spaces at all (the spaces in the output image below are due to the font rendering, there are no ‎0020 SPACE inserted. You can also cross-check with UniView.)

\documentclass{article}

\usepackage{polyglossia}
\usepackage{csquotes}

\makeatletter
% \newattribute\xpg@attr@korean \xpg@attr@korean=0
\newattribute\xpg@attr@korean \xpg@attr@korean=1
\newattribute\xpg@attr@autojosa
\makeatother
\directlua{ require"polyglossia-korean" }

\setdefaultlanguage{english}

\setotherlanguage{chinese}
\newfontfamily\chinesefont[Script=CJK,Language=Chinese Simplified]{Source Han Sans SC}

\begin{document}
(\enquote*{test}) (‘test’) ('test')

(\enquote{test}) (“test”) ("test")

% Half-width parentheses
\textchinese{傑(\enquote*{test})僭(‘test’)割('test')劘}

\textchinese{傑(\enquote{test})僭(“test”)割("test")劘}

% Full-width parentheses
\textchinese{傑（\enquote*{test}）僭（‘test’）割（'test'）劘}

\textchinese{傑（\enquote{test}）僭（“test”）割（"test"）劘}

\end{document}

With \xpg@attr@korean=1:

korean-1

With \xpg@attr@korean=0:

korean-0

jspitz commented 1 month ago

I don't have the expertise to comment on what should and what should not be done for Chinese and/or Japanese. In any account, once the open questions are sorted out and things are well tested, a pull request would be highly welcome, @pcdi !

jspitz commented 1 month ago

(and of course we can help, if needed, implementation-wise!)

dohyunkim commented 1 month ago

On further testing, I noticed that polyglossia-korean.lua inserts spaces in unwanted positions, specifically outside of text that is tagged as CJK.

I think that inserting small spaces between CJK and non-CJK characters is not a bug, but a feature. And this example shows why packages like polyglossia are needed for multi-language documents. Instead of setting attributes globally, we can turn it on for CJK text only.

\makeatletter
\newattribute\xpg@attr@korean %\xpg@attr@korean=1
\newattribute\xpg@attr@autojosa
\directlua{ require"polyglossia-korean" }
\AddToHook{cmd/textchinese/before}{\xpg@attr@korean=1\relax}
\AddToHook{cmd/textchinese/after}{\unsetattribute\xpg@attr@korean}
\makeatother

pcdi commented 1 month ago

You are absolutely right, inserting spaces between CJK and non-CJK characters is indeed an important feature:

If the document is primarily composed of CJKV glyphs with some Latin glyphs sprinkled throughout, then the convention or principle is to use extra space to separate these two classes of glyphs. Conversely, if the document is primarily composed of Latin glyphs with CJKV glyphs sprinkled around, […] then conventional Latin spaces suffice, given the extent to which spaces are important in Western typography. (CJKV Information Processing, 2nd ed., p. 518)

Incidentally, the amount of this spacing seems to differ: For example, thin spaces, quarter-width spaces, or some house rules may be applicable.

Thanks a lot for your solution! This resolves the problem of inserting spaces in non-CJK contexts. Naturally, we also need to add

\AddToHook{env/chinese/before}{\xpg@attr@korean=1\relax}
\AddToHook{env/chinese/after}{\unsetattribute\xpg@attr@korean}

for the respective environments. However, nested function calls are still a problem. Consider this example:

\documentclass{article}

\usepackage{polyglossia}
\usepackage{csquotes}

\makeatletter
\newattribute\xpg@attr@korean %\xpg@attr@korean=1
\newattribute\xpg@attr@autojosa
\directlua{ require"polyglossia-korean" }
\AddToHook{cmd/textchinese/before}{\xpg@attr@korean=1\relax}
\AddToHook{cmd/textchinese/after}{\unsetattribute\xpg@attr@korean}
\makeatother

\setdefaultlanguage{english}

\setotherlanguage{chinese}
\newfontfamily\chinesefont[Script=CJK,Language=Chinese Simplified]{Source Han Sans SC}

\begin{document}

\begin{enumerate}
    \item (\enquote*{test}) (‘test’)

          % Half-width parentheses
    \item \textchinese{傑(\enquote*{test})僭(‘test’)割}

    \item \textchinese{傑(\textenglish{\enquote*{test}})僭(\textenglish{‘test’})割}

    \item \textchinese{傑\textenglish{(\enquote*{test})}僭\textenglish{(‘test’)}割}

    \item \textchinese{傑(\textenglish{test})僭(\textenglish{test})割}

          % Full-width parentheses
    \item \textchinese{傑（\enquote*{test}）僭（‘test’）割}

    \item \textchinese{傑（\textenglish{\enquote*{test}}）僭（\textenglish{‘test’}）割}

    \item \textchinese{傑（\textenglish{test}）僭（\textenglish{test}）割}
\end{enumerate}

\end{document}

Output image

![nested-env](https://github.com/reutenauer/polyglossia/assets/4654151/1432587e-6756-4d2b-aeba-ea4106a710d1)

I found that whether or not an actual 0020 SPACE character is inserted seems to be dependent on the font being used, however the optical spacing is mostly identical across fonts.

@dohyunkim Do you think this spacing issue is something that can be dealt with by polyglossia?

jspitz commented 1 month ago

Thanks a lot for your solution! This resolves the problem of inserting spaces in non-CJK contexts. Naturally, we also need to add
\AddToHook{env/chinese/before}{\xpg@attr@korean=1\relax}
\AddToHook{env/chinese/after}{\unsetattribute\xpg@attr@korean}

This should be done in the gloss, e.g. (untested)

\ifxetex
  \let\xpg@orig@XeTeXlinebreakskip\XeTeXlinebreakskip%
  \let\xpg@orig@XeTeXlinebreaklocale\XeTeXlinebreaklocale%
\fi

\def\chinese@spacing{%
  \ifluatex
     \xpg@attr@korean=1\relax%
  \else
      \XeTeXlinebreaklocale "zh"%
      \XeTeXlinebreakskip = 0pt plus 1pt minus 0.1pt%
  \fi
}

\def\nochinese@spacing{%
  \ifluatex
      \unsetattribute\xpg@attr@korean%
  \else
      \let\XeTeXlinebreakskip\xpg@orig@XeTeXlinebreakskip%
      \let\XeTeXlinebreaklocale\xpg@orig@XeTeXlinebreaklocale%
  \fi
}

[...]

\def\noextras@chinese{%
    \chinese@capsformat%
    \nochinese@spacing%
}

\def\blockextras@chinese{%
    \chinese@capsformat%
    \chinese@spacing%
}

\def\inlineextras@chinese{%
    \chinese@capsformat%
    \chinese@spacing%
}

t-tk commented 1 month ago

@t-tk is there a reason why \XeTeXlinebreaklocale and \XeTeXlinebreakskip are not set in gloss-chinese directly?

It is not deep reason. Because I did not know how we can treat it on Xe/LuaLaTeX. The setting \XeTeXlinebreak... within \ifxetex ... \fi in gloss-chinese seems fine.

Udi-Fogiel commented 1 month ago

This should be done in the gloss, e.g. (untested)


\ifxetex
  \let\xpg@orig@XeTeXlinebreakskip\XeTeXlinebreakskip%
  \let\xpg@orig@XeTeXlinebreaklocale\XeTeXlinebreaklocale%
\fi

It does not save the values of \XeTeXlinebreakskip, \XeTeXlinebreaklocal, but the primitive themselves. For \XeTeXlinebreakskip it should be \xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip, and for \XeTeXlinebreaklocale there is no way that I'm aware to get the current value (we will have to do the bookkeeping our self, and it would theoretically fail if a user set \XeTeXlinebreaklocal manually).

Maybe it would be easier if we simply return to the default \XeTeXlinebreaklocal in \noextras@<lang>, the code would be cleaner, and I don't see a downside to that.

\def\noextras@chinese{% \chinese@capsformat% \nochinese@spacing% }

That wouldn't work. \noextras@chinese is used only by \selectlanguage, so in nested cases of \textchinese with other languages, e.g. in \textchinese{\textenglish{...}} the attribute will stay 1 in the english text. Is there a reason why \noextras<lang> is not used in \text<lang> or the environment equivalent?

jspitz commented 1 month ago

It does not save the values of \XeTeXlinebreakskip, \XeTeXlinebreaklocal, but the primitive themselves. For \XeTeXlinebreakskip it should be \xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip, and for \XeTeXlinebreaklocale there is no way that I'm aware to get the current value (we will have to do the bookkeeping our self, and it would theoretically fail if a user set \XeTeXlinebreaklocal manually).

Thanks!

Maybe it would be easier if we simply return to the default \XeTeXlinebreaklocal in \noextras@<lang>, the code would be cleaner, and I don't see a downside to that.

That would probably cause problems if someone has changed the default value outside the Chinese context.

\def\noextras@chinese{% \chinese@capsformat% \nochinese@spacing% }

That wouldn't work. \noextras@chinese is used only by \selectlanguage, so in nested cases of \textchinese with other languages, e.g. in \textchinese{\textenglish{...}} the attribute will stay 1 in the english text. Is there a reason why \noextras<lang> is not used in \text<lang> or the environment equivalent?

If that is the case, it would need to be fixed anyway, as this is the usual way we locally set and unset things. But I thought we call \noextras<lang> when entering a nested language (no time to check now).

dohyunkim commented 1 month ago

@pcdi Please do not use ASCII punctuations (such as Half-width parentheses so-called) in a Chinese or Japanese text. xpg@attr@korean=1 is not intended for such a case.

Only in Korea, it is the general practice to mix ASCII punctuations (including inter-word SPACE) with CJK characters since the late 20th century. For these quite modern Korean texts, xpg@attr@korean should be 0 or 2. The latter (2) seems to be more suitable for your example.

jspitz commented 1 month ago

It does not save the values of \XeTeXlinebreakskip, \XeTeXlinebreaklocal, but the primitive themselves. For \XeTeXlinebreakskip it should be \xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip, and for \XeTeXlinebreaklocale there is no way that I'm aware to get the current value (we will have to do the bookkeeping our self, and it would theoretically fail if a user set \XeTeXlinebreaklocal manually).

If I understand correctly, we could just set \XeTeXlinebreaklocale "" to undo the change. So the proposed change would be:

\ifluatex
 \directlua{ require"polyglossia-korean" }% rename to polyglossia-CJK-spacing?
\else
  \xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip%
\fi

\def\chinese@spacing{%
  \ifluatex
     \xpg@attr@korean=1\relax%
  \else
      \XeTeXlinebreaklocale "zh"%
      \XeTeXlinebreakskip = 0pt plus 1pt minus 0.1pt%
  \fi
}

\def\nochinese@spacing{%
  \ifluatex
      \unsetattribute\xpg@attr@korean%
  \else
      \XeTeXlinebreakskip=\xpg@orig@XeTeXlinebreakskip%
      \XeTeXlinebreaklocale ""%
  \fi
}

[...]

\def\noextras@chinese{%
    \chinese@capsformat%
    \nochinese@spacing%
}

\def\blockextras@chinese{%
    \chinese@capsformat%
    \chinese@spacing%
}

\def\inlineextras@chinese{%
    \chinese@capsformat%
    \chinese@spacing%
}

@pcdi could you test if that works?

Udi-Fogiel commented 1 month ago

If I understand correctly, we could just set \XeTeXlinebreaklocale "" to undo the change.

Yes, this is what I meant by

Maybe it would be easier if we simply return to the default \XeTeXlinebreaklocal in \noextras@, the code would be cleaner, and I don't see a downside to that.

sorry if that wasn't clear (note that it will still be "wrong" if a user use \XeTeXlinebreaklocal manually).

So the proposed change would be:

There is a need to declare \xpg@orig@XeTeXlinebreakskip as a new skip register, so probably it should be:


\ifluatex
\@ifundefined{xpg@attr@korean}{\newattribute\xpg@attr@korean}{} 
\@ifundefined{xpg@attr@autojosa}{\newattribute\xpg@attr@autojosa}{}
\directlua{ require"polyglossia-korean" } % rename to polyglossia-CJK-spacing?
\def\chinese@spacing{\xpg@attr@korean=\@ne}
\def\nochinese@spacing{\unsetattribute\xpg@attr@korean}
\else
\@ifundefined{xpg@orig@XeTeXlinebreakskip}{\newskip\xpg@orig@XeTeXlinebreakskip}{}
\xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip
\def\chinese@spacing{%
\XeTeXlinebreaklocale "zh"
\XeTeXlinebreakskip = 0pt plus 1pt minus 0.1pt}
\def\nochinese@spacing{%
\XeTeXlinebreakskip=\xpg@orig@XeTeXlinebreakskip
\XeTeXlinebreaklocale ""}
\fi

[...]

\def\noextras@chinese{% \chinese@capsformat \nochinese@spacing }

\def\blockextras@chinese{% \chinese@capsformat \chinese@spacing }

\def\inlineextras@chinese{% \chinese@capsformat \chinese@spacing }