php-gettext / Languages

gettext language list automatically generated from CLDR data
https://php-gettext.github.io/Languages/
Other
70 stars 10 forks source link

How are the "Plural-Form" indices determined? #45

Closed JanisE closed 2 years ago

JanisE commented 2 years ago

Hello!

What determines the "Plural-Form" formula in PO translation files?

I see that the actual way is to use the same order the plural forms appear in this JSON file https://github.com/php-gettext/Languages/blob/master/src/cldr-data/supplemental/plurals.json , which, I suspect, comes from https://github.com/unicode-org/cldr/blob/main/common/supplemental/plurals.xml

Is it how it should be though? Is the CLDR data not just the source of grammar laws in different languages, but also determines the order of "Plural-Form" indices in the PO file format?

My inquiry is based on the problems of different plural form formulas existing for the same language (in my case – the Latvian language). More specifically, formulas with different order of the plural forms.

If we use numbers "1", "2", "0" to denote the three existing plural forms in Latvian (the first one contains "1", the second one contains "2", the third one contains "0"), then the two different formulas being used are "1" => 0, "2" => 1, "0" => 2 "1" => 1, "2" => 2, "0" => 0.

Poedit switched between the two formulas on 2018-05-11 (my isssue about that).

The previously used format matched the English better: singular = 0, plural = 1, third case (zero) = 2. And it would be fine if the CLDR had the zero case put as the last one, but they didn't (for whatever reasons). For Latvian, it is now zero = 0, singular = 1, plural =2.

Which differs from the English and differs from a lot of translation files already made with the old formula. Which is still correct, from the grammar point of view. But the automatic CLDR data importing tools introduced a different order that crated problems because various PO file editors and readers/users didn't expect different formulas (different forms index orders) for the same language.

mlocati commented 2 years ago

The order of plural forms reflects the order of the Unicode CLDR names:

When languages don't define one or more of these cases, we simply take out that case from the indexes, but keeping the order of the remaining cases.

For example, in English we only have the one and other rules, so:

JanisE commented 2 years ago

It could be that CLDR guy who wrote that list didn't give any special thought to the order of it though. And if they change the order (but keep the string keys, so who can blame them?), would you reorder indexes in the plural form formula?

I tried your tool back to 2015 version when it was built, and, yes, it does output the indexes in that order, so it's consistent all right. :) But PO files were created and used and "Plural-Forms" formulas were created (manually) also before that, so who defined the order then? What was the original consideration when creating those formulas? I don't think that CLDR used the same considerations, or even had PO files in mind at all.

As for the CLDR's "naming" of the plural forms.

"zero" is used only for 8 languages, and in the other 7 it actually consists just of the 0. So, is the naming even appropriate in the Latvian case?

Lithuanian is a similar language to Latvian. Their similar plural form that is called "zero" in Latvian, is called "other" in Lithuanian, and what is called "other" in Latvian, is called "few" in Lithuanian. So I don't think that the naming was done very rigidly and unambiguously to use it as the basis of determining PO plural form indices.

I guess that before auto-updating from CLDR, the more English-fitting version was mostly used by PO editors: 0 for singular, 1 for plural, 2 for special cases (it's also mentioned in Latvian translation guidelines used by many translators before). And then CLDR named the special cases "zero" and put it as the first in their list (just because zero seemed to fit the first position best because it's the smallest of them all, right?), and plural form generators and, in turn, systems that use them started generating the other formula: 0 for special cases, 1 for singular, 2 for plural.

And it happened rather silently, so it created incorrect translations because the translation entries did not match the PO header and the plural forms formula. This problem is still in effect in some (maybe most, I don't know) WordPress translation projects, and "Loco Translate" does not support this new formula as well.

mlocati commented 2 years ago

AFAIK there plural rule order is not standardized, and changing it within this library would be a breaking change. BTW you can build your own order by using the categories property of the language instanced

JanisE commented 2 years ago

Although, it should not be a breaking change if both the old and the new formula are correct regarding the particular language grammar (which they are). As PO file should include the formula – whichever it is – the contained translation entries are following, and the tools using the PO file should theoretically take it into consideration (which they are not).

(If you consider it a breaking change, then you're passing the responsibility of not making a breaking change to Unicode CDLR, who may very well not be paying attention to the order of the plural forms in their data at all (it just happens to be in that order and happens not to get changed), and so may probably not even know they have this responsibility.)

But yeah, apparently this is not solvable at this level anymore, what's done is done.