unicode-org / inflection

code, data and documentation related to handling inflection problems
Other
0 stars 1 forks source link

Flesh out grammatical categories #25

Open macchiati opened 3 months ago

macchiati commented 3 months ago

We have a set of grammatical categories/features in CLDR, that are also used in ICU. It would be very useful to flesh out these categories so that we have a consistent set of identifiers for grammatical categories, and lists of which categories are applicable to which languages, and for which scopes.

Currently the data for this is limited:

  1. Nouns & noun clauses: gender, case, definiteness, plurals (cardinals), ordinals, plural ranges.
  2. Two scopes: general and units
  3. Limited locales
    1. gender, case, definiteness: (50) Amharic, Arabic, Armenian, Azerbaijani, Bangla, ... Turkish, Ukrainian, Urdu, Uzbek
    2. plurals: (300+) Afrikaans, Akan, Albanian, Amharic, Anii, Arabic, Aragonese, Armenian, Assamese, Asturian, Asu, Azerbaijani, ... Xhosa, Yakut, Yiddish, Yoruba, Zulu

https://www.unicode.org/cldr/charts/45/grammar/index.html

https://www.unicode.org/cldr/charts/45/supplemental/language_plural_rules.html

nciric commented 3 months ago

I think this would be beneficial to CLDR/ICU with improving data quality and maybe reducing the size. It could also help our effort in defining which categories we want to tackle.