golang / go

The Go programming language
https://go.dev
BSD 3-Clause "New" or "Revised" License
122.22k stars 17.47k forks source link

x/text/language: Match does not work as expected for artificial languages #45749

Open bep opened 3 years ago

bep commented 3 years ago
golang.org/x/text@v0.3.3 golang.org/x/tools@v0.0.0-20180917221912-90fa682c2a6e

The following test failes with got art-x-klingon; want art-x-elvish.

func TestMatchArtificialLanguages(t *testing.T) {
    // See https://en.wikipedia.org/wiki/Codes_for_constructed_languages
    klingon, err := language.Parse("art-x-klingon")
    if err != nil {
        t.Fatal(err)
    }
    elvish, err := language.Parse("art-x-elvish")
    if err != nil {
        t.Fatal(err)
    }

    matcher := language.NewMatcher([]language.Tag{klingon, elvish})

    m, _, _ := matcher.Match(elvish)

    if m.String() != "art-x-elvish" {
        t.Errorf("got %s; want art-x-elvish", m.String())
    }
}

https://github.com/nicksnyder/go-i18n/issues/252

nicksnyder commented 3 years ago

@bep Does this diff change the results?

func TestMatchArtificialLanguages(t *testing.T) {
    // See https://en.wikipedia.org/wiki/Codes_for_constructed_languages
    klingon, err := language.Parse("art-x-klingon")
    if err != nil {
        t.Fatal(err)
    }
    elvish, err := language.Parse("art-x-elvish")
    if err != nil {
        t.Fatal(err)
    }

+   tags := []language.Tag{klingon, elvish}
-   matcher := language.NewMatcher([]language.Tag{klingon, elvish})
+   matcher := language.NewMatcher(tags)

-   m, _, _ := matcher.Match(elvish)
+   _, i, _ := matcher.Match(elvish)
+   m := tags[i]

    if m.String() != "art-x-elvish" {
        t.Errorf("got %s; want art-x-elvish", m.String())
    }
}

(I am curious if this is related to https://github.com/golang/go/issues/24211#issuecomment-378479543)

bep commented 3 years ago

@bep Does this diff change the results?

No. It sill returns the first index.

cherrymui commented 3 years ago

cc @mpvl

ameowlia commented 2 years ago

๐Ÿ‘‹ hello, I'm no expert, but I've been poking around in /x/text recently, so I thought I would look into this. I think the root cause of this is that related to the private tags.

๐Ÿท๏ธ Go is not sorting properly based on private tags.

Here is an example where non-artificial languages have the same issue. https://play.golang.org/p/ZtyNz_CbSgO

    enA, _ := language.Parse("en-x-a")
    enB, _ := language.Parse("en-x-b")
    matcher := language.NewMatcher([]language.Tag{enA, enB})

    m, _, _ := matcher.Match(enB)
    if m.String() != "en-x-b" {
        fmt.Printf("got %s; want en-x-b\n", m.String())
    }

This returns: got en-x-a; want en-x-b

โœ๏ธ Go should sort based on private tags According to rfc4647 "Matching of Language Tags"

In the lookup scheme, the language range is progressively truncated from the end until a matching language tag is located. Single letter or digit subtags (including both the letter 'x', which introduces private-use sequences, and the subtags that introduce extensions) are removed at the same time as their closest trailing subtag. For example, starting with the range "zh-Hant-CN-x-private1-private2" (Chinese, Traditional script, China, two private-use tags) the lookup progressively searches for content as shown below:

Example of a Lookup Fallback Pattern

Range to match: zh-Hant-CN-x-private1-private2

  1. zh-Hant-CN-x-private1-private2
  2. zh-Hant-CN-x-private1
  3. zh-Hant-CN
  4. zh-Hant
  5. zh
  6. (default)

๐Ÿ’ป The code looks like it inspects private tags... but it doesn't

โœ๏ธ I'm working on a fix to both return the private use tags properly and lower the confidence level when they do not match, which should fix this issue.

gopherbot commented 2 years ago

Change https://golang.org/cl/364855 mentions this issue: language/match: match based on all subtags

mpvl commented 2 years ago

The use of private use tags is more or less deprecated. The official tag for Klingon is tlh and for Elvish, if Iโ€™m not mistaken, qya. I can imagine preprocessors to normalize user input tags, but I donโ€™t think this normalization should be applied to tags passed to the normalizer by the developer, for instance.

The marcher does not, nor does it aim to, implement RFC 4647. Instead, the algorithm is an enhancement of a matcher used internally at google, which is generally believed to give more reliable results.

I no longer work at Google, but I doubt the move away from -x support will have reverted.

mpvl commented 2 years ago

Note, the compliance test set is in language/testdata. The CLDR test set is the golden standard.

The two are mostly the same, but there are some notable differences. The Go algorithm uses rule based matching, instead of the scoring algorithm used by Google's algorithm. This has the advantage that it 1) preserves local information across matches (this is the whole point of these BCP 47 tags, so I wouldn't be surprised if other implementations are catching up to this), 2) observes user preferences where a language is preferred interstitially to two dialects of another language (for instance, for supported languages en-GB, nl and a user preference of en, nl, en-GB it matches nl, not en-GB, 3) is much easier to maintain and tweak, and 4) is much faster (due to other performance improvements not related to the rule-based system per se).

Someone of the internationalization team in Google could verify if the Golden set is up to date. But I doubt support for -x has made its inroads, as the world is moving away from this as it was too problematic.

I would not, however, modify the algorithm based on isolated use cases. There are very specific reasons why it is the way it is with years of tweaking and testing based data-driven design.

mpvl commented 2 years ago

As another note. The OP mentions the desire to support art-x-klingon, but this was never a valid language tag for Klingon. The (now deprecated) i-klingon predates the introduction of art. In turn, art was introduced on the same date as tlh (the modern canonical form of Klingon). The tag art only serves as a fallback for artificial languages, so art-x-klingon was never a valid tag, and a matcher relying on matching private use tags would miss this.

This already shows the problem of the use of private use tags: their usage is not stable and changes over time. This is one of the reasons why there was a move away from supporting private use tags.

Note that many artificial languages have their own language code. Here is a list: https://en.wikipedia.org/wiki/Codes_for_constructed_languages

I can imagine supporting private use tags, though, if it does not interact with the rest of tag matching and if thorough consideration of the implications of canonicalization are given. But, as the use of -x is largely deprecated, the impact of considering them for matching regular languages will have to be tested against a large body of user data. My suspicion is that including this information in matches generally will make things worse, not better. This means that supporting matching on private use tags, should probably be limited to specifically supporting art-x-X and maybe tags starting with x-, if at all.

bep commented 2 years ago

The official tag for Klingon is tlh and for Elvish, if Iโ€™m not mistaken,

Those were just two random examples. My original problem is that of handling of "unknown language tags", but you can also easily make the argument for unofficial artificial languages (like Foozzy, a new language i just made up).

ameowlia commented 2 years ago

๐Ÿ˜ฎ Wow, thank you so much for all of this information @mpvl . It has been very hard to discover all of this. (Or maybe there are some docs I am missing?)

What I am taking from this is:

๐Ÿ™ Thanks again!