unicode-org / inflection

code, data and documentation related to handling inflection problems
Other
0 stars 1 forks source link

Support inflection of nouns #19

Open grhoten opened 4 months ago

grhoten commented 4 months ago

We should be able to inflect common nouns and proper nouns. This would typically include being able to modify the grammatical gender, grammatical number and grammatical cases in a lot of languages.

Prepositions in English take on grammatical case in many other languages. Typically in the form of suffixes to nouns. So this makes it related to issue https://github.com/unicode-org/inflection/issues/17.

English possessive/genitive forms of nouns typically need to add 's or just ', but that algorithmic logic is a lot harder in other languages, like German, Danish, Dutch, Russian and so forth.

For example, you should be able to turn "city" into "city's" or turn "cities" into "cities'". For a language like Russian, you can look at кот for an example.

Here's a more compact declension table for looking at such information. cat (кот) singular plural
nominative кот коты
genitive кота котов
accusative коту котам
dative кота котов
instrumental котом котами
prepositional коте котах
nciric commented 4 months ago

+1

I think this was a main use case when we started discussing this project as most of the placeholders in messages are nouns (proper or common).

The solution will probably range from simple/complex algorithms + lexicon exceptions, to potentially ML models for some languages. I feel this is the first problem we should tackle, as it intersects well with common needs.

grhoten commented 4 months ago

I say that for single words or very few words, ML is undesirable. From experience, it’s very resource intensive, which makes it undesirable for resource constrained environments. There are many languages where a traditional algorithmic solution for out of vocabulary words is cheaper, faster, smaller, quicker to implement and more accurate than an ML solution. I have some horror stories around this topic.

If you start handling many words or a whole sentence, ML starts looking more appealing because such solutions thrive on context.

I’d say the only exception to this rule are agglutinative languages, like Finnish and perhaps Turkish. A generally ML approach is more likely accurate in such languages. That requires a lengthier overview and education session on the topic.

The ML versus rule based approach will probably involve a discussion to find the right balance.

nciric commented 4 months ago

I’d say the only exception to this rule are agglutinative languages, like Finnish and perhaps Turkish. A generally ML approach is more likely accurate in such languages. That requires a lengthier overview and education session on the topic.

I expect most languages will be fine with the algorithmic + lexicon approach (and we should focus on those first). I would use ML only when necessary, as you mentioned in Finnish/Turkish. So this is not a decision we need to make a head of time, just a reminder that we need to organize our code to allow different implementations.