golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
124.12k stars 17.68k forks source link

x/text/unicode/bidi: nested isolates don't produce correct visual order #69819

Open dominikh opened 1 month ago

dominikh commented 1 month ago

Consider the following bit of HTML:

  <p dir="ltr">
    The title is <span dir="rtl">אבג <span dir="ltr">C++</span> דהו</span> in Hebrew.
  </p>

This is a Latin paragraph containing a (faux) Hebrew book title that itself contains the Latin name "C++". The title as a whole should render right-to-left, with C++ rendering left-to-right. That is, it should render like this: image

Without the spans, i.e.

  <p dir="ltr">
    The title is אבג C++ דהו in Hebrew.
  </p>

this would render the title as 3 independent runs, resulting in the incorrect

image

The spans map directly to Right-to-Left Isolate (RLI, U+2067), Left-to-Right Isolate (LRI, U+2066), and Pop Directional Isolate (PDI, U+2069). As a Go string, this is

"The title is \u2067אבג \u2066C++\u2069 דהו\u2069 in Hebrew."

which I call the "annotated" version of the plain string

"The title is אבג C++ דהו in Hebrew."

However, when I run the following code that uses the bidi package, both the plain and the annotated string result in the same, incorrect visual order:

package main

import (
    "fmt"
    "log"

    "golang.org/x/text/unicode/bidi"
)

func main() {
    plain := "The title is אבג C++ דהו in Hebrew."
    // This uses RLI, LRI, and PDI to achieve the equivalent to
    //   The title is <span dir="rtl">אבג <span dir="ltr">C++</span> דהו</span> in Hebrew.
    annotated := "The title is \u2067אבג \u2066C++\u2069 דהו\u2069 in Hebrew."

    for _, s := range []string{plain, annotated} {
        var p bidi.Paragraph
        p.SetString(s, bidi.DefaultDirection(bidi.LeftToRight))
        ord, err := p.Order()
        if err != nil {
            log.Fatal(err)
        }
        for i := range ord.NumRuns() {
            run := ord.Run(i)
            fmt.Printf("%d %d %q\n", i, run.Direction(), run.String())
        }
        fmt.Println()
    }
}
0 0 "The title is "
1 1 "אבג"
2 0 " C++ "
3 1 "דהו"
4 0 " in Hebrew."

0 0 "The title is \u2067"
1 1 "אבג \u2066"
2 0 "C++"
3 1 "\u2069 דהו"
4 0 "\u2069 in Hebrew."

bidi.go has the following comment:

// This API tries to avoid dealing with embedding levels for now. Under the hood
// these will be computed, but the question is to which extent the user should
// know they exist. We should at some point allow the user to specify an
// embedding hierarchy, though.

but I'd still expect the computed visual order to be correct with respect to the embedding levels, even if the levels themselves aren't exposed to the user.

I've confirmed with Firefox and Chrome that my use of RLI/LRI/PDI produces the expected rendering that is identical to the one using spans.

(Take special care when reading this issue in a browser that handles right-to-left text, the strings in the code samples and output will be displayed in visual order, not logical order. I've attached all code as an archive to avoid confusion. For Emacs users, (setq bidi-display-reordering nil) is a handy way of disabling reordering to be able to inspect file contents in logical order.)

bidi.tar.gz

cherrymui commented 1 month ago

cc @mpvl