lojjic / bidi-js

A pure JavaScript implementation of the Unicode Bidirectional Algorithm
MIT License
41 stars 7 forks source link

getReorderedString of mixed RTL, LTR and Emoji text sometimes differs from browser bidi behavior #9

Open tombigel opened 4 months ago

tombigel commented 4 months ago

When passing a mixed RTL, LTR and Emoji string to getReorderedString, if the sentence is processed as rtl, and the emoji is right after an LTR block it seems like it will be considered as a part of the LTR block, while the browser seems to considers it as a new RTL block (I'm not sure I'm 100% familiar with the implementation to use the right terms here):

if the original string is
RRR LLL N (where N is the Emoji) The browser will Render it (in a "direction: rtl" block) as N LLL RRR but bidi.js will return LLL N RRR

if the original string is
RRR LLL R N
both the browser and bidi.js will render it as N R LLL RRR

a codepen demo: https://codepen.io/tombigel/pen/gOJYQom

Is it expected?

tombigel commented 1 month ago

I did some research (at least tried to) and got some interesting info:

What I wanted to do was add some new tests to BidiCharacterTest.txt or at least to a parallel file in the same format

I thought it's going to be easy - just grab some short strings with Emojis, run a known-to-work bidi implementation on them (I used Fribidi CLI), get the embedding levels, the visual order, and extend the tests with 4-5 simple cases.

But, as nothing is easy with bidi stuff, I stumbled:

  1. Fribidi apparently has a similar bug as bidi-js. it changes the order of the variable width part of the string and breaks it.
  2. even though the tests description uses the phrase "code points" and not "character codes" I couldn't find any use of 32bit code points, which made me start second guessing myself whether I'm using the correct method...
    For example, using js codePointAt() I got this: "0043 05d0 1f471 1f3fd 200d 2642 fe0f" for "Cא👱🏽‍♂️" but there is nothing like this in any of the unicode bidi tests.
  3. I tried to dig through the unicode-bidi spec (13, 14 and 15), looking for any comment about "variable width" or "multibyte", but either there is nothing, or the lingo is not the one I was looking for, or I just don't really understand it well enough.
  4. tried to build and use pybidi, but it doesn't have any embedding levels api, just string-in / string-out and it also garbled my string... didn't help
    Some technical details:

Fribidi CLI output format:

<string>
Base direction: <basedir> 
<ltov> 
<vtol>
<levels>

Here are the fribidi results with the text "Cא👱🏽‍♂️" :

> fribidi bidi2.txt -v
Cא👱🏽‍♂️
Base direction: L
0 1 2 3 4 5 6
0 1 2 3 4 5 6
0 1 0 0 0 0 0 %
> fribidi bidi2.txt -v --rtl
 ♂️‍🏽👱אC
Base direction: R
6 5 4 3 2 0 1
5 6 4 3 2 1 0
2 1 1 1 1 1 1 %

So, what should the visual order be? 2 3 4 5 6 0 1 ?
And the logical? 2 3 4 5 6 1 0 ?
And I understand the levels output of the LTR case , but not of the RTL one.
Looks like it marked the entire Emoji sequence as RTL

for the LTR case bidi-js returns the same levels as Fribidi, but for the RTL case it returns 2 1 2 2 2 2 1 1 1 and the string "️♂‍👱🏽אC" so it seems to do at least a part of the job correctly, but it misses the variable width boundaries?

That's my 2 cents research anyway

lojjic commented 1 month ago

Thanks Tom for the helpful analysis!

I believe there are two issues at play:

1) Bidi.js is not properly handling surrogate code points; it's operating on JS string character indices individually. This is, I believe, causing the embedding levels to be incorrect for these characters. This should be a fairly easy fix: update all string iteration to operate on code points rather than JS string chars, and ensure the end result still lines up with the JS string indices.

2) Once that's fixed, we still have the issue of how to handle the visual/logical reordering based on those corrected embedding levels. I found this in the Bidi spec section L3:

Combining marks applied to a right-to-left base character will at this point precede their base character. If the rendering engine expects them to follow the base characters in the final display process, then the ordering of the marks and the base character must be reversed.

While this refers specifically to combining marks (e.g. zero width joiner \u200d, which does appear in your example string), I think it would also apply to emoji variation selectors like \ufe0f. And maybe to emoji sequences in general?

This un-flipping would be done as a special case in getReorderedIndices, not before.