latex3 / babel

The multilingual framework to localize LaTeX, LuaLaTeX and XeLaTeX
https://latex3.github.io/babel/
LaTeX Project Public License v1.3c
130 stars 35 forks source link

Automatic switching of language/script issue with combining marks #112

Closed lyndondrake closed 1 year ago

lyndondrake commented 3 years ago

The automatic language switching feature is awesome. With Hebrew, I've run into a little problem though, to do with multiple combining marks. The following works fine, although the paragraph is still set as a LRT paragraph:

שיר המעלות לשלמה אם יהוה לא יבנה בית שוא עמלו בוניו בו אם יהוה לא ישמר עיר שוא
שקד שומר׃
Screenshot 2021-01-09 at 12 02 12 PM

But this fails:

1 ‏בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ 2 ‏וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ
אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם 3 ‏וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר 4 ‏וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר
כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ (Gen 1:1-4 BHS-T)
Screenshot 2021-01-09 at 12 02 25 PM

The more verbose form works perfectly, but is much harder to read:

\begin{otherlanguage}{hebrew}
\foreignlanguage{british}{[1]} ‏בְּרֵאשִׁ֖ית בָּרָ֣א אֱלֹהִ֑ים אֵ֥ת הַשָּׁמַ֖יִם וְאֵ֥ת הָאָֽרֶץ׃‎ \foreignlanguage{british}{[2]} ‏וְהָאָ֗רֶץ הָיְתָ֥ה תֹ֨הוּ֙ וָבֹ֔הוּ וְחֹ֖שֶׁךְ עַל־פְּנֵ֣י תְה֑וֹם וְר֣וּחַ אֱלֹהִ֔ים מְרַחֶ֖פֶת עַל־פְּנֵ֥י הַמָּֽיִם׃‎ \foreignlanguage{british}{[3]} ‏וַיֹּ֥אמֶר אֱלֹהִ֖ים יְהִ֣י א֑וֹר וַֽיְהִי־אֽוֹר׃‎ \foreignlanguage{british}{[4]} ‏וַיַּ֧רְא אֱלֹהִ֛ים אֶת־הָא֖וֹר כִּי־ט֑וֹב וַיַּבְדֵּ֣ל אֱלֹהִ֔ים בֵּ֥ין הָא֖וֹר וּבֵ֥ין הַחֹֽשֶׁךְ׃ \foreignlanguage{british}{(Gen 1:1-4 BHS-T)}
\end{otherlanguage}
Screenshot 2021-01-09 at 12 02 43 PM

A number of other applications can set the second example directly (e.g. Mellel, even Emacs strangely enough!).

I'm guessing it is something to do with detecting which code points are combining marks that mark the combined character as Hebrew?

V happy to test this.

Also (maybe needs to be a separate issue), there's a bidi algorithm which e.g. Emacs uses to determine whether a paragraph in an LTR document is RTL. So that second example gets marked as a RTL paragraph. I'm assuming I could do something similar with a Unicode RTL marker, perhaps?

jbezos commented 3 years ago

Works for me, with FreeSerif (although this font misplaces some cantillation marks) and Arial. Please, provide a minimal example. Which version are you using?

As to auto-detecting the paragraph direction, it's not usually a good idea when there is an explicit markup, only with plain text (case of Emacs, whose bidi algorithm I studied for babel). Not that I reject the idea, but it's not trivial. See for example Additional Requirements for Bidi in HTML & CSS.

lyndondrake commented 3 years ago

Sorry about that - turns out if I copy and paste from Logos, it works. Copy/paste from Accordance doesn't. So there must be some extra invisible characters in the Accordance export :-( Apologies for the non-issue, but your confirmation that it worked for you made me try something different. I'm super impressed with the overall workability of the automatic switching!

I'm generally working from a plain text file (either org-mode or Pandoc-flavoured Markdown). The reason is that I find it easier to write like that. With Pandoc, I can put a <div lang="he"> around the Hebrew paragraphs and they get transformed nicely.

For org-mode, it's not as obvious how to go about it. I can probably just put the LaTeX environment in and it should be carried over in the LaTeX export.

I don't know how to check the babel version, and I've attached my not-entirely-minimal test file (you could drop my font out). lualatex-hebrew-test.tex.txt lualatex-hebrew-test.pdf

lyndondrake commented 3 years ago

I did just note that I have bidi=basic in my babel load. Is that what I should be using?

lyndondrake commented 3 years ago

And having read through that document, could there potentially be an option provided to babel that (effectively) sets the equivalent of dir="auto" for paragraphs that don't have an explicit language environment? That way the default behaviour could be left alone, and lazy typists like me could try to avoid marking paragraphs directly.

jbezos commented 3 years ago

You are welcome. I'm not sure if it's a task for babel or the converter from Org or Markdown, but it's worth investigating (not in the short term, I'm afraid).

jbezos commented 1 year ago

The W3C still discourages the use of dir="auto", which is left as a last resort. With LaTeX we must know on beforehand how to deal with boxes and the like, and once the node list has been created it cannot be “reversed”, which means the document must be preprocessed before typesetting it. This is out of the scope of babel.