Open pcdi opened 2 months ago
After checking the commit that introduced support (acfa1744a80bebc536c4cf89a9636164ca84d419), it seems that the current support is indeed dependent on XeTeX:
@t-tk is there a reason why \XeTeXlinebreaklocale
and \XeTeXlinebreakskip
are not set in gloss-chinese
directly?
@dohyunkim is there something we can do with LuaTeX?
polyglossia-korean.lua
provides CJK line-breaking functionality.
It is intended to be used with Korean texts, but can be abused for Chinese or Japanese.
To load this lua file, we have to declare two attributes in advance:
\makeatletter
\newattribute\xpg@attr@korean \xpg@attr@korean=1
\newattribute\xpg@attr@autojosa
\makeatother
\directlua{ require"polyglossia-korean" }
The latter one xpg@attr@autojosa
is irrelevant to Chinese or Japanese.
The key is the first one xpg@attr@korean
, and we can give 0
, 1
, or 2
as its value.
1
is for classical documents which have no spaces between words, it will be more suitable than others for Chinese or Japanese texts.
Of course I can't guarantee that the result will be reasonble or at least acceptable to the eyes of native Chinese or Japanese people.
Based on a quick comparison of WP: Line breaking rules in East Asian languages with polyglossia-korean.lua, it seems that the basic line breaking rules are reasonably similar if not identical between Korean and Chinese/Japanese.
Amending the above example with the code provided by @dohyunkim yields an acceptable result, as can be seen below. Punctuation is also put in the right position at the beginning or end of a line, as expected.
After checking Ken Lunde's CJKV Information Processing, “Line Breaking and Word Wrapping,” p. 352–355, it seems there are more sophisticated ways to achieve line breaking in CJK (such as hanging punctuation or keeping the text fully aligned to the character grid), but I am unsure if this is something polyglossia
should be doing. I would be interested in your opinion on that. The first edition of CJKV Information Processing is also available as a free loan on the Internet Archive here (need to register for a free account, though). See for example p. 345 for an explanation of the character grid.
The fonts are rescaled in the following example to match the output by luatex-ja
above, as luatex-ja
scales the fonts to 0.962216 by default (see sections 2.3, 2.4 of the luatex-ja
manual).
\documentclass{article}
\usepackage{polyglossia}
\makeatletter
\newattribute\xpg@attr@korean \xpg@attr@korean=1
\newattribute\xpg@attr@autojosa
\makeatother
\directlua{ require"polyglossia-korean" }
\newcommand*{\teststring}{%
傑僭割劘匾叟喝塌姿嬴幰廋扇扉搨摩榻溲潛瀛瘦瞎磨窖竇箭篠簉糙綢纛羸翁翦%
翩肓臝艘花裯褐謁譖豁贏轄返迷途造週遍遭選遼鄰釁閼雕靠靡颼飯驎鬣魔麗麟%
}
\newcommand*{\teststringpunct}{%
傑僭割劘匾叟喝塌姿嬴幰廋扇扉搨摩榻溲潛瀛瘦瞎磨窖竇箭篠簉糙綢纛羸翁。(翦)。%
翩肓臝艘花裯褐謁譖豁贏轄返迷途造週遍遭選遼鄰釁閼雕靠靡颼飯驎鬣魔麗麟%
}
\setdefaultlanguage{english}
\setotherlanguage{chinese}
\newfontfamily\chinesefont[Script=CJK,Language=Chinese Simplified,Scale=0.962216]{Source Han Sans SC}
\setotherlanguage{japanese}
\newfontfamily\japanesefont[Script=CJK,Language=Japanese,Scale=0.962216]{Source Han Sans}
\begin{document}
\textchinese{\teststring}
\begin{chinese}
\teststring\par
\end{chinese}
\textjapanese{\teststring}
\begin{japanese}
\teststring\par
\end{japanese}
\textchinese{\teststringpunct}
\begin{chinese}
\teststringpunct\par
\end{chinese}
\textjapanese{\teststringpunct}
\begin{japanese}
\teststringpunct\par
\end{japanese}
\end{document}
On further testing, I noticed that polyglossia-korean.lua
inserts spaces in unwanted positions, specifically outside of text that is tagged as CJK. For example, it inserts a space in between a (half-width) parenthesis and quotation marks if they are directly adjacent (with or without csquotes
). You can also test with \xpg@attr@korean=0
, which solves the problem by not inserting any spaces at all (the spaces in the output image below are due to the font rendering, there are no 0020 SPACE
inserted. You can also cross-check with UniView.)
\documentclass{article}
\usepackage{polyglossia}
\usepackage{csquotes}
\makeatletter
% \newattribute\xpg@attr@korean \xpg@attr@korean=0
\newattribute\xpg@attr@korean \xpg@attr@korean=1
\newattribute\xpg@attr@autojosa
\makeatother
\directlua{ require"polyglossia-korean" }
\setdefaultlanguage{english}
\setotherlanguage{chinese}
\newfontfamily\chinesefont[Script=CJK,Language=Chinese Simplified]{Source Han Sans SC}
\begin{document}
(\enquote*{test}) (‘test’) ('test')
(\enquote{test}) (“test”) ("test")
% Half-width parentheses
\textchinese{傑(\enquote*{test})僭(‘test’)割('test')劘}
\textchinese{傑(\enquote{test})僭(“test”)割("test")劘}
% Full-width parentheses
\textchinese{傑(\enquote*{test})僭(‘test’)割('test')劘}
\textchinese{傑(\enquote{test})僭(“test”)割("test")劘}
\end{document}
With \xpg@attr@korean=1
:
With \xpg@attr@korean=0
:
I don't have the expertise to comment on what should and what should not be done for Chinese and/or Japanese. In any account, once the open questions are sorted out and things are well tested, a pull request would be highly welcome, @pcdi !
(and of course we can help, if needed, implementation-wise!)
On further testing, I noticed that
polyglossia-korean.lua
inserts spaces in unwanted positions, specifically outside of text that is tagged as CJK.
I think that inserting small spaces between CJK and non-CJK characters is not a bug, but a feature. And this example shows why packages like polyglossia are needed for multi-language documents. Instead of setting attributes globally, we can turn it on for CJK text only.
\makeatletter
\newattribute\xpg@attr@korean %\xpg@attr@korean=1
\newattribute\xpg@attr@autojosa
\directlua{ require"polyglossia-korean" }
\AddToHook{cmd/textchinese/before}{\xpg@attr@korean=1\relax}
\AddToHook{cmd/textchinese/after}{\unsetattribute\xpg@attr@korean}
\makeatother
You are absolutely right, inserting spaces between CJK and non-CJK characters is indeed an important feature:
If the document is primarily composed of CJKV glyphs with some Latin glyphs sprinkled throughout, then the convention or principle is to use extra space to separate these two classes of glyphs. Conversely, if the document is primarily composed of Latin glyphs with CJKV glyphs sprinkled around, […] then conventional Latin spaces suffice, given the extent to which spaces are important in Western typography. (CJKV Information Processing, 2nd ed., p. 518)
Incidentally, the amount of this spacing seems to differ: For example, thin spaces, quarter-width spaces, or some house rules may be applicable.
Thanks a lot for your solution! This resolves the problem of inserting spaces in non-CJK contexts. Naturally, we also need to add
\AddToHook{env/chinese/before}{\xpg@attr@korean=1\relax}
\AddToHook{env/chinese/after}{\unsetattribute\xpg@attr@korean}
for the respective environments. However, nested function calls are still a problem. Consider this example:
\documentclass{article}
\usepackage{polyglossia}
\usepackage{csquotes}
\makeatletter
\newattribute\xpg@attr@korean %\xpg@attr@korean=1
\newattribute\xpg@attr@autojosa
\directlua{ require"polyglossia-korean" }
\AddToHook{cmd/textchinese/before}{\xpg@attr@korean=1\relax}
\AddToHook{cmd/textchinese/after}{\unsetattribute\xpg@attr@korean}
\makeatother
\setdefaultlanguage{english}
\setotherlanguage{chinese}
\newfontfamily\chinesefont[Script=CJK,Language=Chinese Simplified]{Source Han Sans SC}
\begin{document}
\begin{enumerate}
\item (\enquote*{test}) (‘test’)
% Half-width parentheses
\item \textchinese{傑(\enquote*{test})僭(‘test’)割}
\item \textchinese{傑(\textenglish{\enquote*{test}})僭(\textenglish{‘test’})割}
\item \textchinese{傑\textenglish{(\enquote*{test})}僭\textenglish{(‘test’)}割}
\item \textchinese{傑(\textenglish{test})僭(\textenglish{test})割}
% Full-width parentheses
\item \textchinese{傑(\enquote*{test})僭(‘test’)割}
\item \textchinese{傑(\textenglish{\enquote*{test}})僭(\textenglish{‘test’})割}
\item \textchinese{傑(\textenglish{test})僭(\textenglish{test})割}
\end{enumerate}
\end{document}
I found that whether or not an actual 0020 SPACE
character is inserted seems to be dependent on the font being used, however the optical spacing is mostly identical across fonts.
@dohyunkim Do you think this spacing issue is something that can be dealt with by polyglossia
?
Thanks a lot for your solution! This resolves the problem of inserting spaces in non-CJK contexts. Naturally, we also need to add
\AddToHook{env/chinese/before}{\xpg@attr@korean=1\relax} \AddToHook{env/chinese/after}{\unsetattribute\xpg@attr@korean}
This should be done in the gloss, e.g. (untested)
\ifxetex
\let\xpg@orig@XeTeXlinebreakskip\XeTeXlinebreakskip%
\let\xpg@orig@XeTeXlinebreaklocale\XeTeXlinebreaklocale%
\fi
\def\chinese@spacing{%
\ifluatex
\xpg@attr@korean=1\relax%
\else
\XeTeXlinebreaklocale "zh"%
\XeTeXlinebreakskip = 0pt plus 1pt minus 0.1pt%
\fi
}
\def\nochinese@spacing{%
\ifluatex
\unsetattribute\xpg@attr@korean%
\else
\let\XeTeXlinebreakskip\xpg@orig@XeTeXlinebreakskip%
\let\XeTeXlinebreaklocale\xpg@orig@XeTeXlinebreaklocale%
\fi
}
[...]
\def\noextras@chinese{%
\chinese@capsformat%
\nochinese@spacing%
}
\def\blockextras@chinese{%
\chinese@capsformat%
\chinese@spacing%
}
\def\inlineextras@chinese{%
\chinese@capsformat%
\chinese@spacing%
}
@t-tk is there a reason why
\XeTeXlinebreaklocale
and\XeTeXlinebreakskip
are not set ingloss-chinese
directly?
It is not deep reason.
Because I did not know how we can treat it on Xe/LuaLaTeX.
The setting \XeTeXlinebreak...
within \ifxetex ... \fi
in gloss-chinese
seems fine.
This should be done in the gloss, e.g. (untested)
\ifxetex \let\xpg@orig@XeTeXlinebreakskip\XeTeXlinebreakskip% \let\xpg@orig@XeTeXlinebreaklocale\XeTeXlinebreaklocale% \fi
It does not save the values of \XeTeXlinebreakskip
, \XeTeXlinebreaklocal
, but the primitive themselves. For \XeTeXlinebreakskip
it should be \xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip
, and for \XeTeXlinebreaklocale
there is no way that I'm aware to get the current value (we will have to do the bookkeeping our self, and it would theoretically fail if a user set \XeTeXlinebreaklocal
manually).
Maybe it would be easier if we simply return to the default \XeTeXlinebreaklocal
in \noextras@<lang>
, the code would be cleaner, and I don't see a downside to that.
\def\noextras@chinese{% \chinese@capsformat% \nochinese@spacing% }
That wouldn't work. \noextras@chinese
is used only by \selectlanguage
, so in nested cases of \textchinese
with other languages, e.g. in \textchinese{\textenglish{...}}
the attribute will stay 1 in the english text. Is there a reason why \noextras<lang>
is not used in \text<lang>
or the environment equivalent?
It does not save the values of
\XeTeXlinebreakskip
,\XeTeXlinebreaklocal
, but the primitive themselves. For\XeTeXlinebreakskip
it should be\xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip
, and for\XeTeXlinebreaklocale
there is no way that I'm aware to get the current value (we will have to do the bookkeeping our self, and it would theoretically fail if a user set\XeTeXlinebreaklocal
manually).
Thanks!
Maybe it would be easier if we simply return to the default
\XeTeXlinebreaklocal
in\noextras@<lang>
, the code would be cleaner, and I don't see a downside to that.
That would probably cause problems if someone has changed the default value outside the Chinese context.
\def\noextras@chinese{% \chinese@capsformat% \nochinese@spacing% }
That wouldn't work.
\noextras@chinese
is used only by\selectlanguage
, so in nested cases of\textchinese
with other languages, e.g. in\textchinese{\textenglish{...}}
the attribute will stay 1 in the english text. Is there a reason why\noextras<lang>
is not used in\text<lang>
or the environment equivalent?
If that is the case, it would need to be fixed anyway, as this is the usual way we locally set and unset things. But I thought we call \noextras<lang>
when entering a nested language (no time to check now).
@pcdi Please do not use ASCII punctuations (such as Half-width parentheses
so-called) in a Chinese or Japanese text. xpg@attr@korean=1
is not intended for such a case.
Only in Korea, it is the general practice to mix ASCII punctuations (including inter-word SPACE) with CJK characters since the late 20th century. For these quite modern Korean texts, xpg@attr@korean
should be 0
or 2
. The latter (2
) seems to be more suitable for your example.
It does not save the values of
\XeTeXlinebreakskip
,\XeTeXlinebreaklocal
, but the primitive themselves. For\XeTeXlinebreakskip
it should be\xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip
, and for\XeTeXlinebreaklocale
there is no way that I'm aware to get the current value (we will have to do the bookkeeping our self, and it would theoretically fail if a user set\XeTeXlinebreaklocal
manually).
If I understand correctly, we could just set \XeTeXlinebreaklocale ""
to undo the change. So the proposed change would be:
\ifluatex
\directlua{ require"polyglossia-korean" }% rename to polyglossia-CJK-spacing?
\else
\xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip%
\fi
\def\chinese@spacing{%
\ifluatex
\xpg@attr@korean=1\relax%
\else
\XeTeXlinebreaklocale "zh"%
\XeTeXlinebreakskip = 0pt plus 1pt minus 0.1pt%
\fi
}
\def\nochinese@spacing{%
\ifluatex
\unsetattribute\xpg@attr@korean%
\else
\XeTeXlinebreakskip=\xpg@orig@XeTeXlinebreakskip%
\XeTeXlinebreaklocale ""%
\fi
}
[...]
\def\noextras@chinese{%
\chinese@capsformat%
\nochinese@spacing%
}
\def\blockextras@chinese{%
\chinese@capsformat%
\chinese@spacing%
}
\def\inlineextras@chinese{%
\chinese@capsformat%
\chinese@spacing%
}
@pcdi could you test if that works?
If I understand correctly, we could just set
\XeTeXlinebreaklocale ""
to undo the change.
Yes, this is what I meant by
Maybe it would be easier if we simply return to the default \XeTeXlinebreaklocal in \noextras@
, the code would be cleaner, and I don't see a downside to that.
sorry if that wasn't clear (note that it will still be "wrong" if a user use \XeTeXlinebreaklocal
manually).
So the proposed change would be:
There is a need to declare \xpg@orig@XeTeXlinebreakskip
as a new skip register, so probably it should be:
\ifluatex \@ifundefined{xpg@attr@korean}{\newattribute\xpg@attr@korean}{} \@ifundefined{xpg@attr@autojosa}{\newattribute\xpg@attr@autojosa}{} \directlua{ require"polyglossia-korean" } % rename to polyglossia-CJK-spacing? \def\chinese@spacing{\xpg@attr@korean=\@ne} \def\nochinese@spacing{\unsetattribute\xpg@attr@korean} \else \@ifundefined{xpg@orig@XeTeXlinebreakskip}{\newskip\xpg@orig@XeTeXlinebreakskip}{} \xpg@orig@XeTeXlinebreakskip=\XeTeXlinebreakskip \def\chinese@spacing{% \XeTeXlinebreaklocale "zh" \XeTeXlinebreakskip = 0pt plus 1pt minus 0.1pt} \def\nochinese@spacing{% \XeTeXlinebreakskip=\xpg@orig@XeTeXlinebreakskip \XeTeXlinebreaklocale ""} \fi
[...]
\def\noextras@chinese{% \chinese@capsformat \nochinese@spacing }
\def\blockextras@chinese{% \chinese@capsformat \chinese@spacing }
\def\inlineextras@chinese{% \chinese@capsformat \chinese@spacing }
I am not able to get line breaking to work with CJK languages with LuaLaTeX. Is this intended behavior? The polyglossia documentation states that it is provides a mechanism for “Loading the appropriate hyphenation patterns” (p. 4). Even though CJK languages do not have hyphenation, I was hoping for some kind of basic line breaking functionality, even if not as sophisticated as specialized packages such as
ctex
orluatexja
might be able to provide.In XeTeX, this kind of basic line breaking was possible by including something along the lines of:
However, this is obviously not possible in LuaTeX. Is there any way to get CJK line breaking with polyglossia only?
The following example illustrates the problem:[^1] [^1]: Test string are the 68 URO code points that have five unique glyphs in Source Han Sans/Serif, see the font readme under “Glyph Sharing Statistics”, p. 16.
Version info:
Output image
![Polyglossia-CJK](https://github.com/reutenauer/polyglossia/assets/4654151/94939c74-ce67-432b-bc68-1be043116061)Line breaking does work if
luatexja-fontspec
is loaded, however this also has many unintended side effects. For example,and so on are interpreted as CJK punctuation and thus the font is switched to a CJK font regardless whether or not the text is inside a CJK language environment defined through polyglossia, which means simply using luatexja-fontspec alongside polyglossia is not straightforward. Luatexja-fontspec also redefines fonts by itself, which runs counter to the intention of using polyglossia/fontspec to define font-language pairs in the first place.
Output image
![Luatex-ja-CJK](https://github.com/reutenauer/polyglossia/assets/4654151/1f85a056-f4d9-4fd8-8b76-c8e3f52294de)