latex3 / babel

The babel system for LaTeX, LuaLaTeX and XeLaTeX
LaTeX Project Public License v1.3c
123 stars 34 forks source link

Arabic diacritics displaced by kashida.plain transforms #257

Open lueck opened 10 months ago

lueck commented 10 months ago

Thanks for your great effort on the kashida feature! I know, that it's still in experimental state. I'd like to point to some issues.

Here's a MWE with an analytic tool for investigating the input characters (for helping people like me who have to typeset Arabic but can't read it). It uses \makebox[LENGTH][s]{TESTCASE} for forcing kashida elongation on single words for demonstration.

\documentclass{book}

\usepackage{luabidi}
\setRTLmain

\usepackage[ngerman,english,bidi=basic]{babel}[2021/05/16]% version 3.59 or later

\babelprovide[import,main,%
justification=kashida,%
transforms=kashida.plain%
]{arabic}

\babelfont{rm}[Scale=3]{FreeSerif} % {ArabicTypesetting} %

\usepackage{luacode}

\begin{filecontents*}[overwrite]{analyzestring.lua}
-- for a given string make a table about the characters it contains
function analyzestring(s)
   tex.print("\\begin{tabular}{rrrrc}\\\\")
   tex.sprint("bytes", "&unicode10", "&unicode16", "&utf8", "&char", "\\\\\\hline")
   for p, c in utf8.codes(s) do
      -- get the UTF8 byte representation
      if (c < 0x80) then
         byt = string.format("0x%02x", string.byte(utf8.char(c), 1))
         position = string.format("%d", p)
      elseif (c < 0x800) then
         byt = string.format("0x%04x", string.byte(utf8.char(c), 1) * 0x100 + string.byte(utf8.char(c), 2))
         position = string.format("%d..%d", p+1, p)
      elseif (c < 0x10000) then
         byt = string.format("0x%06x", string.byte(utf8.char(c), 1) * 0x10000 + string.byte(utf8.char(c), 2) * 0x100 + string.byte(utf8.char(c), 3))
         position = string.format("%d..%d", p+2, p)
      else
         byt = string.format("0x%08x", string.byte(utf8.char(c), 1) * 0x1000000 + string.byte(utf8.char(c), 2) * 0x10000 + string.byte(utf8.char(c), 3) * 0x100 + string.byte(utf8.char(c), 4))
         position = string.format("%d..%d", p+3, p)
      end
      tex.sprint(position, "&",
                 c, "&",
                 string.format("U+%04x", c), "&",
                 byt, "&",
                 utf8.char(c)
                 )
      tex.print("\\\\")
   end
   tex.print("\\end{tabular}")
end
\end{filecontents*}
\directlua{require "analyzestring.lua"}

% output a test case with \case{NUMBER}{WORD}{EXPECTATION}{DESCRIPTION}
\newcommand*{\case}[4]{%
  \noindent #1 %
  \directlua{Babel.arabic.justify_enabled=false}%
  #2 %
  -- #3 %
  \directlua{Babel.arabic.justify_enabled=true}%
  \hfill%
  \fbox{\makebox[5em][s]{#2}}%
  % table about the characters in the test case
  \\{\LTR\tiny%
    #4\\
    \directlua{analyzestring("\luaescapestring{#2}")}%
  }%
  \vskip 10mm%
}

\begin{document}

\case{1}{تَثَنَّى}{تَـثَـنَّى}{Kashidas should be inserted \emph{after} non-spacing marks like ARBIC FATHA, U+064e.}

\case{2}{تَـثَـنَّى}{تَــثَــنَّى}{Existing Kashidas should be further elongated.}

\case{3}{تَــــــثَـنَّى}{تَــثَــنَّى}{Existing Kashidas should be homogenized.}

\case{4}{بِأَبي}{بِـأَبي}{There should be no Kashida at end. But for Arabic Typesetting, there is.}

\end{document}

TEX engine: LuaHBTeX, Version 1.17.0 (TeX Live 2023)

babel version: 2023/08/09 v3.92.22182 The Babel package (from github)

false-kashida-0

The output per test case is (from right to left): Number, input, expecation, result (box).

  1. In test case 1 you can see, that the diacritcs (vowels) are displaced horizontally from the letters (consonants) by kashidas. As far as I know, the FATHA (U+064e) should stay above the consonant instead of being deferred to the left. The kashida should be inserted after all the diacritics that belong to a consonant.

I tried to fix this by changing

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[]*[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

to

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

where the second () is moved behind the regex for the diacritics []*. But this makes the diacritic disappear, when a kashida is inserted behind the consonant the FATHA refers to.

I also tried special rules for consonant+vowel combinations like

kashida.plain.2.0 = { ()ثَ()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

But again, the effect is that the FATHA disappears. So, I guess, we need 2-letter and 3-letter rules for getting this right. Somehow like below, but I don't know the syntax for 2 and 3 letter rules.

; 3-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][][]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
; 2-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب][]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
; 1-letter
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
  1. In test case 2 you can see, that kashidas, that exist in the input, prevent further elongation. I think this is like #243.

This can be fixed by adding the kashida into the first regex character class:

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثبـ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
  1. If you want to make the kashida insertion homogenous, like @amarakon would like to see it in #243, we could drop it in a 1-letter rule (the same way that makes the diacritic go away in my attempts):
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثب][ـ]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }

try this which makes diacritics go away (too bad!) and kashidas homogenous:

kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثب][]*[ـ]*()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
  1. For the font ArabicTypesetting, I get kashidas at the end of a word for some letters.

Could you point be to a documentation of transformation rules?

jbezos commented 10 months ago

I'm facing several technical issues/limitations and I'm a bit stuck. See for example https://tex.stackexchange.com/questions/686767/process-hbox-with-luaotfload and also https://github.com/harfbuzz/harfbuzz/pull/3762#issuecomment-1531726473. Some others are related to the fonts, which sometimes don't seem to take into account the kashida (with clearly misplaced diacritics).

I’ll read carefully your looong report (I wish they were all like that 🙂). There are some explanations here:

The horizontal placement of diacritics is under the direct control of babel, and I was working on an option to set it (start, center, end).

lueck commented 10 months ago

https://latex3.github.io/babel/guides/non-standard-hyphenation-with-luatex.html

Thanks! That enables me to make more informed experiments.

With the following transformation rules, the horizontal displacement of diacritcs is solved using 1-letter rules:

; insert kashida into pattern with certain consonant combinations
kashida.plain.1.0 = { ()[يئهشسقفغعضصنمكلظطخحجثتب]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.1.1 =   { kashida = 500 }
; one diacritic mark: insert kashida behind it
kashida.plain.2.0 = { [يئهشسقفغعضصنمكلظطخحجثتب]()[ًٍَُِّ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.2.1 =   { kashida = 500 }
; two diacritic marks: insert kashida behind them
kashida.plain.3.0 = { [يئهشسقفغعضصنمكلظطخحجثتب][ًٍَُِّ]()[ًٍَُِّ]()[يئهشسقفغعضصنمكلظطخحجثتباأإآوؤذدزرة] }
kashida.plain.3.1 =   { kashida = 500 }
kashida.plain.4.0 = { ()ل()[ًٍَُِّ]*[اأإآ] }
kashida.plain.4.1 =   { kashida = 0 }

But in the output, the kashida is displaced vertically:

case1

lueck commented 10 months ago

.. so the y-axis-offset should, that results from lifting diacritics, should be reset before inserting kashida (and maybe restored afterwards).---I can't guarantee, that this is a TeX-like formulation of a fix...

lueck commented 10 months ago

With the changes from my kashida-after-diacritics branch, I now get a result for my case 1, which I am happy with:

case1-fixed-fs

If you would rather keep the kashida.plain transform as it is, I would suggest to make this to an alternative transform called kashida.after.diacritics.

Should I open a PR?

lueck commented 10 months ago

Hm, with other fonts in still get bad results where the kashida is shifted above the baseline for some character combinations.

jbezos commented 10 months ago

I'm somewhat busy right now. Allow me a week or so.

lueck commented 10 months ago

@jbezos No problem! Sorry for mixin in #243 and writing such a cumulative issue. Also my \case{4}... should be an other issue, see #258.

I managed to get very fine results in the meantime.

In order to leave kashida.plain as it is, I made another branch where I added justification rules named kashida.afterdiacritics.plain. I also squashed my suggested changes to babel.dtx into one commit in order to make it more comprehensible.

By default, the logic of kashida insertion is unchanged. Only with \directlua{Babel.arabic.kashida_after_diacritics = true} the creation of the node for a kashida is changed, so that it can be placed correctly.

This is the result I get with Babel.arabic.kashida_after_diacritics = false: displaced

And this is the result I get with Babel.arabic.kashida_after_diacritics = true. If you look very carefully, I'll notice that the kashida is not always at the same y offset, which is a feature of this font.

corrected

New cases 5 and 6:

\case{5}{للشُهْبِ}{لـلــشُـهْبِ}{Kashida 3 displaced}

\case{6}{تَأَصَّلَ}{تَـأَصَّـلَ}{Kashidas with bad x and y offset.}
jbezos commented 10 months ago

With the changes from my kashida-after-diacritics branch,

Thanks. I’m reading the code and there is a point that can mislead and should be clarified. FreeSerif doesn’t use the PUA, but luaotfload, mainly as a trick to access glyphs without a Unicode point. Relying on what luaotfload does internally isn’t safe. The problem is that in the justification step, the node list often contains these PUA codes, the exact meaning of which is often unknown. This is one of the technical issues/limitations I was talking about.

jbezos commented 10 months ago

I’ll work on some of your ideas. The new transform can be useful in ‘plain’ fonts, not involving ligatures, but with the latter it’s still an unsolved issue, except by creating rules specific to a font. For the JALT table I devised a hack based on parsing twice some frequent cases, with the normal form and the elongated one, but it’s basically a proof of concept that can’t go very far (and it only works with Sakkal Majalla, and not quite – again diacritics is the problem).

The vertical positioning of tashkil is not (usually) fixed, and they are shifted by the font depending on the character. I was working on something similar to the JALT variants to catch the correct yoffset (and xoffset, actually) with kashida, but it seems some (many) fonts don’t bother to deal with kashida and they are clearly misplaced (kasrah is usually too low).

jbezos commented 9 months ago

Your transform are now available (in version 3,94), with name kashida.base:

https://latex3.github.io/babel/news/whats-new-in-babel-3.94.html#new-transform-for-kashida