latex3 / fontspec

Font selection in LaTeX for XeTeX and LuaTeX
http://latex3.github.io/fontspec/
LaTeX Project Public License v1.3c
277 stars 34 forks source link

luatex + harfbuzz and the zero width joiner U+200D #418

Open ralessi opened 4 years ago

ralessi commented 4 years ago

In some cases, namely when commands are inserted between characters, luatex + harfbuzz do not seem to handle the zero width joiner character (U+200D) properly. Consider the following example, to be compiled with lualatex-dev:

\documentclass[12pt]{article}
\usepackage{fontspec}

\newfontfamily\arabicfont{Amiri}[Script=Arabic]
\newfontfamily\arabicfonthb{Amiri}[Script=Arabic,Renderer=Harfbuzz]

\usepackage{ulem}

\begin{document}

\textdir TRT\arabicfont
دَخَلَ مُب‍\uline{‍تَ‍}‍سِمًا

\medskip

\textdir TRT\arabicfonthb
دَخَلَ مُب‍\uline{‍تَ‍}‍سِمًا

\end{document}

test-zwj

u-fischer commented 4 years ago

I don't get your output with the development version of luaotfload. With it is looks like this:

image

This is still not correct, but

ralessi commented 4 years ago

Thank you for the references which I will explore. I suspected that this might be unrelated to fontspec. Do you think it should be worth reporting this---maybe unrelated again---issue to the luaotfload bug tracker?

khaledhosny commented 4 years ago

FWIW, this seems to be a regression in luaotfload. Trying the following with harflatex and the old harf code:

\documentclass[12pt]{minimal}
\usepackage{harfload}
\usepackage{ulem}
\begin{document}

\font\arabicfont="[Amiri-Regular.ttf]:mode=harf"
\textdir TRT\arabicfont
مُب^^^^200d\uline{^^^^200dتَ^^^^200d}^^^^200dسِم

\end{document}

Gives:

zauguin commented 4 years ago

This was a luaotfload bug which is resolved in the latest dev branch.

zauguin commented 4 years ago

The behavior of HarfBuzz seems a bit odd here but I don't know enough about the script to say if it is a bug or expected behaviour:

The luaotfload bug was that in \hboxes the direction wasn't recognized correctly. So the \uline argument was set as TLT instead of TRT.

Now to the odd part: For some reason, HarfBuzz seems to reverse the cluster with the arabic characters and ignore the previous ZWJ. This can be reproduced with hb-shape:

hb-shape --direction=rtl --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=arab --unicodes=U+200D,U+062A,U+064E,U+200D

gives

[space=1+0|uni064E=1@-188,0+0|uni062A.medi=1+244|space=0+0]

as expected, but replacing --direction=rtl with --direction=ltr gives

[space=0+0|space=1+0|uni064E=1@-212,0+0|uni062A.init=1+190]

Especially both space glyphs representing the ZWJs are at the beginning and the initial form is used.

@khaledhosny Is this supposed to happen?

khaledhosny commented 4 years ago

Yes, sort of.

HarfBuzz wants to shape scripts in their native direction. So when setting a direction other than the native direction for a script, HarfBuzz will reverse the buffer before shaping. It will also avoid breaking grapheme clusters, as one does not want, say, a mark to precede its base. ZWJ is a grapheme extender, so the first ZWJ is consider a grapheme cluster by itself (as it extends nothing) and the base+mark+ZWJ are considered another grapheme cluster.

<U+200D>,<U+062A,U+064E,U+200D>

After reversal:

<U+062A,U+064E,U+200D>,<U+200D>

After shaping the buffer will be reversed again since the native direction is RTL (a simple reversal this time with no grapheme clusters business).

U+062A,U+064E,U+200D,U+200D

After reversal:

U+200D,U+200D,U+064E,U+062A

If you set the script to latn when the direction is ltr, no reversal will happen:

 $ hb-shape --direction=ltr --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=latn --unicodes=U+200D,U+062A,U+064E,U+200D
 [space=0+0|uni062A=1+926|uni064E=1+0|space=1+0]

latn with rtl will do the initial reversal but not the last one:

$ hb-shape --direction=rtl --font-file "$(kpsewhich Amiri-Regular.ttf)" --script=latn --unicodes=U+200D,U+062A,U+064E,U+200D
[uni062A=1+926|uni064E=1+0|space=1+0|space=0+0]

Shaping a script in a direction other than its native direction is risky and unlikely to always give meaningful result.

zauguin commented 4 years ago

@khaledhosny Thank you.