servo / unicode-bidi

Implementation of the Unicode Bidirection Algorithm in Rust
Other
74 stars 34 forks source link

Discrepancy compared with ICU4C bidi results #116

Closed jfkthame closed 5 months ago

jfkthame commented 7 months ago

This is not necessarily a bug, but it caught me by surprise and I'd like to raise it for consideration.

It turns out that the Levels as resolved by unicode-bidi don't always match those from ICU4C's ubidi functions, when handling directional-override and -embedding controls.

Example: passing the text 'A', RLO, 'X', LRO, 'B', to ICU4C's ubidi_setPara, and then calling ubidi_getLevels to retrieve the resulting levels array, we get [0, 1, 1, 2, 2]; i.e. the RLO control is assigned level 1, as is the direction-overridden 'X'; then the LRO control gets level 2, as does the 'B'.

But when unicode-bidi processes this text, it gives us [0, 0, 1, 1, 2] instead; i.e. the override controls keep the level of the preceding character, and the new level is only applied after the control itself. (This is most easily seen with the utf-16 API, which more closely parallels ICU4C; using the utf-8 API, the levels array will be [0, 0, 0, 0, 1, 1, 1, 1, 2] because each bidi control occupies 3 bytes.)

A similar discrepancy occurs with the embedding overrides RLE / LRE.

I don't believe this is a spec violation; https://www.unicode.org/reports/tr9/#BD11 tells us that:

once explicit directional formatting characters have been used to assign embedding levels to the characters in a paragraph, embedding initiators and PDFs are removed (or virtually removed) from the paragraph. Thus, the embedding levels assigned to the embedding initiators and PDFs themselves are irrelevant.

Neither ICU4C nor unicode-bidi actually removes the embedding controls, however, and so the caller can "see" what levels were assigned to them. And the fact that they don't match may come as an unwelcome surprise.

In Retaining BNs and Explicit Formatting Characters, the spec gives some implementation guidance (as an informative note, not a normative part of the spec); if I'm understanding it correctly, the behavior of unicode-bidi here follows from the first bullet point in the notes there. So I think unicode-bidi is more closely following the guidelines here, and ICU4C is the one that is deviating, but OTOH it is also the more long-established implementation and seems unlikely to change.

Any thoughts here? The mismatching results from APIs that look as though they should be equivalent is a tricky "gotcha" for anyone looking to migrate from ICU4C to unicode-bidi and making use of the resolved levels array.

Manishearth commented 7 months ago

Yep, I did a bunch of work to ensure unicode-bidi was matching the rules under Retaining BNs.

In general the conformance tests strongly avoid testing this, because it's not something the spec cares about. I don't think we should either. You're welcome to file an ICU4C jira ticket asking for matching the algorithm closer, but it really shouldn't matter. Anyone comparing these levels arrays should be ignoring these elements (and also ignoring non-first-byte entries!)

(closing but happy to continue discussion!)

MoSal commented 5 months ago

So, just to be clear. This is supposed to be correct?

assert!(BidiInfo::new("\u{2067}ف f", None).paragraphs[0].level.is_ltr());

An RLI can turn a paragraph from RTL to LTR! That is surely unexpected.

(Real use-case is a shaping library using this to determine span info. And a library user trying to force direction is failing because of it.)

Manishearth commented 5 months ago

I don't think that's related to this issue, but that comes from this line of the spec:

P2. In each paragraph, find the first character of type L, AL, or R while skipping over any characters between an isolate initiator and its matching PDI or, if it has no matching PDI, the end of the paragraph

The default is LTR, it's not that the RLI turns an RTL paragraph into an LTR one, it's that it hides the characters in that paragraph making those characters no longer able to affect paragraph directionality.

Manishearth commented 5 months ago

This result also matches the reference implementation which spits out a resolved paragraph level of 0 (LTR)

The library user is using this character incorrectly (or perhaps they are using it correctly and the library is making an incorrect inference based on resolved paragraph level: I can't know for sure without really knowing what's going on)

MoSal commented 5 months ago

So, I'm experimenting with cosmic-term, and the shaping library is cosmic-text.

Cosmic term sends terminal lines to the shaper like "<some text><spaces until end of terminal line>".

I made the change to prefix lines with LRI to prevent dir auto-detection, which gives you the expected behavior in terminals.

Now I wanted to add the option to change this at run-time, adding auto and rtl modes. That's when I noticed that prefixing with RLI or FSI always gives you LTR alignment.

So I'm wondering if this is a bug in cosmic-text, considering that RLI is supposed to be equivalent to dir="rtl". Or maybe it's correct behavior, and a new API that allows forcing direction just needs to be added.

Manishearth commented 5 months ago

RLI is dir=RTL for the contained text, not for the paragraph. In other words, it's the equivalent of sticking a <span dir=RTL> inside the text, not the equivalent of adding dir=RTL for the surrounding paragraph.

You should instead set the default paragraph direction when initializing the API. If the library you are using doesn't expose this, consider asking for it.

Alternatively start the text with LRM/RLM, they will have the side effect of forcing the paragraph direction when the algorithm is set to autodetecting it. I don't think they'll have any other effect as long as they are at the beginning of the string.

Manishearth commented 5 months ago

closing as not planned for now, if someone really wants parity with ICU4C for the levels of BN I'm happy to discuss use cases, but for now I'd say anyone planning to compare algorithms should ignore BNs, just like the conformance tests

MoSal commented 5 months ago

Thank you for your detailed responses despite my off-topic intrusion.