servo / unicode-bidi

Implementation of the Unicode Bidirection Algorithm in Rust
Other
74 stars 34 forks source link

Point/period handling in a mixed Latin/Arabic text #52

Closed RazrFalcon closed 3 years ago

RazrFalcon commented 5 years ago

I have a string that looks like this اقرأ المزيد عن SVG 1.1 SE أيضًا.. Applications that support BIDI will put . at the end of the line. The problem is that I'm shaping the BidiInfo::visual_runs output via harfbuzz and since unicode-bidi keeps the . at the end, harfbuzz with the RTL flag puts it in the front. So the string ends up looking like this اقرأ المزيد عن SVG 1.1 SE .أيضًا.

Is this a unicode-bidi issue or am I using harfbuzz incorrectly?

I'm not familiar with bidi algorithms and text shaping, so maybe I'm missing something obvious.

behnam commented 5 years ago

Afaik, harfbuzz doesn't perform any reordering.

Now, on the bidi resolution layer: the position of a period (logically at the end of a sentence) depends on the "base direction" of the text block. If you set that to LTR, will get the period on the right end. If base direction is RTL, the period ends up on the left.

So maybe you need to adjust the base direction here?

RazrFalcon commented 5 years ago

Afaik, harfbuzz doesn't perform any reordering.

Yes. That's why I'm using BidiInfo::visual_runs.

So maybe you need to adjust the base direction here?

I'm using RTL for the last block, aka أيضًا.. If I set LTR here, a period will be on the right end, but glyphs will be messed up.

Here ara raw chars of the last block:

' '
'أ'
'ي'
'ض'
'\u{64b}'
'ا'
'.'

If I switch and . everything works as expected.

mbrubeck commented 5 years ago

What are you passing for as the base level for the paragraph (the second argument to BidiInfo::new)? Do you get your expected result if you pass Some(Level::ltr())?

behnam commented 5 years ago

@RazrFalcon Something to remember is that, you need to perform bidi resolution on the whole text block (paragraph), not line-by-line.

Assuming that you do that, having Lever::rtl() as the base direction, you should get the correct this visual ordering for the last line:

  1. '.'
  2. 'ا'
  3. '\u{64b}'
  4. 'ض'
  5. 'ي'
  6. 'أ'
  7. ' '

You should be able to verify that with a println!().

Then, if you have problems with getting that ordering to give you the correct rendering using harfbuzz, you need to look at what's going on there.

Also, could you verify that the period is in the right place (end of the sentence, logically) in the input string?

RazrFalcon commented 5 years ago

@mbrubeck Thanks! Passing Some(Level::ltr()) fixed the problem. No idea why.

RazrFalcon commented 5 years ago

@behnam

Assuming that you do that, having Lever::rtl() as the base direction

I thought that the base direction is automatic. I think we need a better documentation for default_para_level. At least for newbies like me.

Also, could you verify that the period is in the right place (end of the sentence, logically) in the input string?

I got this string from Google Translate...

AFAIU, BIDI resolving algorithm can work in different ways, depending on a base direction, and I didn't know that. In my case I needed LTR direction.

khaledhosny commented 5 years ago

HarfBuzz does not reorder the runs, but it takes RTL in logical order (same as LTR text) and outputs glyphs in visual order, so you need to reorder the runs but not the individual characters inside each run.

RazrFalcon commented 5 years ago

@khaledhosny Do I need to reorder them when I'm using hb_buffer_set_direction?

khaledhosny commented 5 years ago

No, you should keep the text of each run in logical order.

mbrubeck commented 3 years ago

To summarize, the position of the period depends on the "base direction" or paragraph level. The period should appear at the right end of the string only if the base direction is LTR. If this is the desired behavior, you can set the base direction to LTR explicitly as suggested above.

If you do not pass an explicit base direction, the base direction is set automatically based on the first letter of the paragraph. The default base direction for the test case above is RTL because the first letter of the paragraph has the Bidirectional Character Type AL (Arabic Letter).