python-babel / babel

The official repository for Babel, the Python Internationalization Library
http://babel.pocoo.org/
BSD 3-Clause "New" or "Revised" License
1.31k stars 438 forks source link

Russian has incorrect plural rules #711

Open hcubism opened 4 years ago

hcubism commented 4 years ago

My team ran into the this error while attempting to parse our PO files: WARNING: msg has more translations than num_plurals of catalog

Upon further investigation, I traced the issue to Babel's plural ruleset for Russian being incorrect.

The current ruleset has 3 plural forms with the following plural expression:

n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2

However, according to the Unicode CLDR, the ruleset has 4 plural forms, and the expression should be:

n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<12 || n%100>14) ? 1 : n%10==0 || (n%10>=5 && n%10<=9) || (n%100>=11 && n%100 <=14) ? 2 : 3
akx commented 4 years ago

Hi there!

It looks like the plural rule collection in Babel's PO handling has been left languishing for a while (most of those rules are from 2007).

In a perfect world, we'd regenerate the gettext rules from the CLDR specification and the plural data we already ingest, but it looks like our to_gettext() plural rule function is . . . suboptimal. For ru, it currently generates a monstrosity like

((((0 == 0)) && (((n % 10) >= 2 && (n % 10) <= 4))) && (!(((n % 100) >= 12 && (n % 100) <= 14)))) ? 1 : (((((0 == 0)) && (((n % 10) == 0))) || (((0 == 0)) && (((n % 10) >= 5 && (n % 10) <= 9)))) || (((0 == 0)) && (((n % 100) >= 11 && (n % 100) <= 14)))) ? 2 : ((((0 == 0)) && (((n % 10) == 1))) && (!(((n % 100) == 11)))) ? 0 : 3

and I wouldn't quite like to use that. (Anyone up for some expression optimization?)

So, um, @hcubism, where'd you get that corrected expression? I can't find it in that syntax on the CLDR page linked.

hcubism commented 4 years ago

Hello!

I wrote the expression based on the criteria in the Rules column on that CLDR page. However, I didn't take into account the v != 0 case, so yeah it's still not 100% correct, but for my team's purpose it works well enough.

In the meantime, I ended up patching messages/plurals.py by replacing the Russian line in PLURALS like so:

# OLD
# 'ru': (3, '(n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<10 || n%100>=20) ? 1 : 2)'),
# NEW
'ru': (4, '(n%10==1 && n%100!=11 ? 0 : n%10>=2 && n%10<=4 && (n%100<12 || n%100>14) ? 1 : n%10==0 || (n%10>=5 && n%10<=9) || (n%100>=11 && n%100 <=14) ? 2 : 3)'),
ri-gilfanov commented 3 years ago

Hi @hcubism and @akx. Not sure if you need the fourth form in the rule. Noun forms after floating point numbers are identical to the form after integers 2, 3, 4, 22, 23, 24, 32, etc.

Regarding the monstrousness of the rule. It seems to me that you can experiment with the sequence of conditions.

Python implementation for example:

def print_form(n, forms):
    if n % 10 == 0 or ((n % 1 == n) and (((n % 10 >= 5) and (n % 10 <= 9)) or (n % 100 >= 11) and (n % 100 <= 19))):
        print(1)
        return print(n, forms[2])
    elif n % 10 == 1:
        return print(n, forms[0])
    else:
        return print(n, forms[1])

forms = ['кот', 'кота', 'котов']
for n in (0, 1, 2, 4, 5, 19, 20, 21, 0.5, 1.5, 13.666):
    print_form(n, forms)

Not sure, but try testing this rule:

'ru': (3, '(n%10==0 || (n%1==n && ((n%10>=5 && n%10<=9) || (n%100>=11 && n%100<=19))) ? 2 : n%10 == 1 ? 0 : 1)'),

By the way, the current rules for Russian, Ukrainian, Belorussian, Croatian and Bosnian are the same. Probably, if you change rules for one language, then it is worth changing for the other.

Best regards