Missing pluralization forms

ruby-gettext / gettext

Gettext gem is a pure Ruby Localization(L10n) library and tool which is modeled after the GNU gettext package.

https://ruby-gettext.github.io/

69 stars 28 forks source link

Missing pluralization forms #85

Closed MichaelHoste closed 3 years ago

MichaelHoste commented 3 years ago

We noticed that some pluralization forms are missing from here: https://github.com/ruby-gettext/gettext/blob/master/lib/gettext/tools/msginit.rb#L322

One of the important missing ones is Arabic.

I have 2 questions:

1. Rules importation

Would you accept a pull request that extend these rules using the CLDR official source extracted to valid GetText rules using this tool?

export-plural-rules prettyjson > file.json returns a file like this, and the idea would be to add it to this project and load the pluralization rules from it in the plural_forms method:

{
    "af": {
        "name": "Afrikaans",
        "formula": "n != 1",
        "plurals": 2,
        "cases": [
            "one",
            "other"
        ],
        "examples": {
            "one": "1",
            "other": "0, 2~16, 100, 1000, 10000, 100000, 1000000, …"
        }
    },
    "ak": {
        "name": "Akan",
        "formula": "n > 1",
        "plurals": 2,
        "cases": [
            "one",
            "other"
        ],
        "examples": {
            "one": "0, 1",
            "other": "2~17, 100, 1000, 10000, 100000, 1000000, …"
        }
    },
 ...
}

We could even simplify it to have a file like this:

{
    "af": {
        "formula": "n != 1",
        "plurals": 2
    },
    "ak": {
        "formula": "n > 1",
        "plurals": 2
    },
 ...
}

or even transform it into a Ruby hash. What would be your preference?

2. nplurals > 4

There are currently no more than 4 pluralization forms, and it will change with Arabic that needs 6 of them.

Do you expect that there will be problems with the rest of the code? Or just improving the plural_forms method will be enough to make it work with msgstr[4] and msgstr[5]?

Thank you for any feedback that you may have on this.

I am volunteering to work on this issue with your guidance.

kou commented 3 years ago

Rules importation

Do you know the license of the source XML file?

I'm positive of this approach if its license is compatible with ours (LGPLv3 or later).

We'll implement reader script by REXML instead of using export-plural-rules. We don't want to depend on PHP.

nplurals > 4

Hmm, I don't know without trying. Let's try.

MichaelHoste commented 3 years ago

Do you know the license of the source XML file?

The license is here: https://github.com/unicode-org/cldr/blob/master/unicode-license.txt

I'm not knowledgeable enough to know if it's compatible with LGPLv3.

We'll implement reader script by REXML instead of using export-plural-rules. We don't want to depend on PHP.

Creating GetText rules using the XML file is not obvious. I don't know if it's worth the effort to rewrite everything in Ruby. For my purposes, I added a Ruby exporter to the PHP package: https://github.com/michaelhoste/Languages/tree/ruby

export-plural-rules ruby > plural_rules.rb creates this file: https://gist.github.com/MichaelHoste/1ed1647aca8bdce34cefa200b10e6fd9

It should be easy enough to move forward.

I'm quite worried about this pluralization issue, I noticed that some languages have more plural forms than before. Like French where they added a "many" plural form for multiples of 1000000 (I speak French and it's quite stupid in my opinion).

For Arabic and these languages with more plural forms, won't the existing GetText plural forms be shifted and the existing msgstr[1] (other) become msgstr[2] after the update? Existing projects will be impacted and they will need to rewrite all their plural forms.

It's not only a problem for Ruby, but for other languages as well since it's more of a GetText limitation. Once https://github.com/php-gettext/Languages and https://github.com/eemeli/make-plural will be updated to the latest Unicode version, PHP and JS will have the same issue.

I'm not sure how GetText should deal with this situation. Do you have any thoughts on this?

kou commented 3 years ago

I've implemented CLDR plurals reader to https://github.com/red-data-tools/red-datasets . We can use CLDR plurals information with the following code:

require "datasets"
plurals = Datasets::CLDRPlurals.new
plurals.each do |locale|
  pp locale.rules
end

I may not understand fully of your concern but it may not be a problem. Each .po/.mo file includes plural forms information such as nplurals=2; plural=(n != 1). So existing .po/.mo can still use old plural form rules. New .po/.mo with the latest plural forms information can use new plural form rules but they don't affect to existing .po/.mo.

MichaelHoste commented 3 years ago

Thank you for implementing the parsing of CLDR plurals!

However I'm not sure the current implementation will be precise enough to create GetText rules.

As you can see here, CLDR specifications are more complex than GetText specifications and some modifications should be applied to adapt one to the other.

Sometimes, even the number of plural is different, like in Czech:

CLDR:

Screenshot 2021-04-15 at 10 45 46

GetText:

With the rule (n == 1) ? 0 : ((n >= 2 && n <= 4) ? 1 : 2)

Screenshot 2021-04-15 at 10 43 56

Please note than v in the rule represents the number of visible fraction digits in n, and should disappear from the GetText rule. Doing so, there is one less plural case in GetText than in the CLDR representation.

i extracted this compact hash of rules from my previous snippet and I'll use it internally, maybe that could be useful here too: https://gist.github.com/MichaelHoste/83fb089975efbc019e30613c383964cc

I may not understand fully of your concern but it may not be a problem. Each .po/.mo file includes plural forms information such as nplurals=2; plural=(n != 1). So existing .po/.mo can still use old plural form rules. New .po/.mo with the latest plural forms information can use new plural form rules but they don't affect to existing .po/.mo.

You're absolutely right! I didn't think of that, rules are only created once when using MsgInit, and reused with MsgMerge.

MichaelHoste commented 3 years ago

nplurals > 4

Hmm, I don't know without trying. Let's try.

For information, it seems to work just fine 🙂

I created a local project with Arabic (6 plurals) and this rule:

(n == 0) ? 0 : ((n == 1) ? 1 : ((n == 2) ? 2 : ((n % 100 >= 3 && n % 100 <= 10) ? 3 : ((n % 100 >= 11 && n % 100 <= 99) ? 4 : 5))))

Using n_("%{num} stuff", "%{num} stuffs", num) I have the correct block in the generated PO:

msgid "%{num} stuff"
msgid_plural "%{num} stuffs"
msgstr[0] ""
msgstr[1] ""
msgstr[2] ""
msgstr[3] ""
msgstr[4] ""
msgstr[5] ""

And when I translate all the plurals, it works correctly and select the appropriate plural depending on num

kou commented 3 years ago

Implemented.

MichaelHoste commented 3 years ago

Thanks a lot for this! It's highly appreciated.

Unfortunately I found a small issue with this implementation.

I discovered an inconsistency for the Arabic language between my own implementation[1] and yours.

Here is the rule you generated:

(n == 1) ? 0 : (n == 0) ? 1 : (n == 2) ? 2 : ((n % 100) >= 3 && (n % 100) <= 10) ? 3 : ((n % 100) >= 11 && (n % 100) <= 99) ? 4 : 5

And mine:

(n == 0) ? 0 : ((n == 1) ? 1 : ((n == 2) ? 2 : ((n % 100  >= 3 &&  n % 100  <= 10) ? 3 : ((n % 100  >= 11 &&  n % 100  <= 99) ? 4 : 5))))

You will notice that n == 0 and n == 1 are inverted, and it seems that the second one is correct:

Screenshot 2021-06-15 at 11 54 49

Do you have any idea where this inconsistency could come from?

[1] based on a fork of https://github.com/php-gettext/Languages that generated these rules

I'm sorry but I also have a more global issue with your implementation.

To me, it seems wrong to download and cache the complete list of rules through https://github.com/red-data-tools/red-datasets for these reasons:

Critical external dependency: If the plurals.xml file is corrupted or if it's a work-in-progress on the Master branch, or even if the source repository is moved, this gem will stop working.
Continuous Integration: Unauthenticated requests on raw.githubusercontent.com are limited to 60 requests per hour. If projects depend on gettext, their CI could break this limit quite quickly and their tests would break.
Caching issue: Once the plurals.xml file is cached locally, it will not be updated if the source file is updated. It could create some misunderstanding if two projects that are apparently identical behave differently. The plural file is not even stored in the project but somewhere deep in the system and is not easy to find.
Offline programming: if the file is not cached yet, the gem will not work offline.

My suggestion is that plural rules should be versioned with this gem and be part of the project itself, and not evolve independently. You don't expect the same version of the gem to behave differently with time.

What do you think about it? Does it make sense to you? Maybe Datasets::CLDRPlurals.new could take an optional parameter to load a local file?

MichaelHoste commented 3 years ago

I discovered inconsistencies for other languages too. Maybe it can help isolate the issue:

Language: is Generated rule: 0 Correct rule: n % 10 != 1 || n % 100 == 11

Languages: gu, hi, zu, kn, fa, am, as, bn Generated rule: (n == 0) || (n == 1) ? 1 : 0 Correct rule: n > 1 (opposite!)

The other inconsistencies I found are:

br
((n % 10) == 1) && (((n % 100) != 11)) && (((n % 100) != 71)) && (((n % 100) != 91)) ? 0 : ((n % 10) == 2) && (((n % 100) != 12)) && (((n % 100) != 72)) && (((n % 100) != 92)) ? 1 : ((n % 10) >= 3 && (n % 10) <= 4) || ((n % 10) == 9) && (((n % 100) < 10 || (n % 100) > 19)) && (((n % 100) < 70 || (n % 100) > 79)) && (((n % 100) < 90 || (n % 100) > 99)) ? 2 : (n != 0) && ((n % 1000000) == 0) ? 3 : 4
(n % 10 == 1 && n % 100 != 11 && n % 100 != 71 && n % 100 != 91) ? 0 : ((n % 10 == 2 && n % 100 != 12 && n % 100 != 72 && n % 100 != 92) ? 1 : ((((n % 10 == 3 || n % 10 == 4) || n % 10 == 9) && (n % 100 < 10 || n % 100 > 19) && (n % 100 < 70 || n % 100 > 79) && (n % 100 < 90 || n % 100 > 99)) ? 2 : ((n != 0 && n % 1000000 == 0) ? 3 : 4)))

cy
(n == 1) ? 0 : (n == 0) ? 1 : (n == 2) ? 2 : (n == 3) ? 3 : (n == 6) ? 4 : 5
(n == 0) ? 0 : ((n == 1) ? 1 : ((n == 2) ? 2 : ((n == 3) ? 3 : ((n == 6) ? 4 : 5))))

fil
(n == 1) || (n == 2) || (n == 3) || (((n % 10) != 4)) && (((n % 10) != 6)) && (((n % 10) != 9)) ? 1 : 0
n != 1 && n != 2 && n != 3 && (n % 10 == 4 || n % 10 == 6 || n % 10 == 9)

lv
((n % 10) == 1) && ((n % 100) != 11) ? 0 : ((n % 10) == 0) || ((n % 100) >= 11 && (n % 100) <= 19) ? 1 : 2
(n % 10 == 0 || n % 100 >= 11 && n % 100 <= 19) ? 0 : ((n % 10 == 1 && n % 100 != 11) ? 1 : 2)

mk
((n % 10) == 1) && ((n % 100) != 11) ? 1 : 0
n % 10 != 1 || n % 100 == 11

tl
(n == 1) || (n == 2) || (n == 3) || (((n % 10) != 4)) && (((n % 10) != 6)) && (((n % 10) != 9)) ? 1 : 0
n != 1 && n != 2 && n != 3 && (n % 10 == 4 || n % 10 == 6 || n % 10 == 9)

tzm
(n <= 1) || (n >= 11 && n <= 99) ? 1 : 0
n >= 2 && (n < 11 || n > 99)

Maybe some of them are caused by the fact that we don't use the same CLDR version, but the one I checked are definitely wrong regarding the latest rules on https://unicode-org.github.io/cldr-staging/charts/latest/supplemental/language_plural_rules.html