Gender normalization (localization)

skalee commented 3 years ago

@ronaldtse I got a couple of questions. See IEV 102-04-22 on old Electropedia.

Entry in Serbian (апсциса, <дуж криве> ж јд) has gender ж јд, which probably means "feminine singular". We surely need to display it this way, but the question is how to represent it in data? ж јд or normalized as f?
Entry in Dutch (abscis, m/f) has gender, m/f which means "masculine with optional feminine" (it's different than masculine, feminine, or neuter). We surely need to display it this way, but the question is how to represent it in data? m/f or maybe there is another common notation for mixed masculine-feminine genders like this one?

Also note that there may be more genders or alike. For example, in some Slavic languages (Czech, Slovene) nouns are further divided into animate and inanimate ones. I am not sure how important is that, but Wiktionary denotes that next to gender (see this).

skalee commented 3 years ago

Furthermore, some languages may have different set of genders for singular and plural. One example is Polish, in which most linguist distinguish 5 genders: 3 for singular (masculine, feminine and neuter) and 2 for plural (virile and non-virile). It must be noted though that some linguists prefer different classifications, for example Polish entries in IEV use an old-school approach with masculine and feminine genders in plural (102-03-13).

My conclusion is that it will be difficult to develop a discrete set of genders which will work for every language and for every project. Perhaps we should allow arbitrary genders, but I'm not sure if Glossarist Desktop supports that. Perhaps we should be even more elastic and describe terms with an array of arbitrary grammar classifiers rather than have separate fields for gender, plurality, etc.

strogonoff commented 3 years ago

For what it’s worth, here is how grammatical properties of nouns are typed in Glossarist model:

https://github.com/glossarist/glossarist-desktop/blob/4105c7a2b2b1f5085c748af3ce0fdb27fd7e3149/src/models/concepts.ts#L188

Common and neuter genders are supported.
Grammatical number (plural/singular) and gender are separate.

Not sure if this helps and what you are trying to achieve, just saw this issue in my notifications.

skalee commented 3 years ago

What does "common" gender stand for? Is it kinda "not applicable" or "unspecified"? Or maybe it's kinda "masculine or feminine, but not neuter"?

Not sure what you are trying to achieve.

I'm trying to achieve something more elastic as there are languages which have more than three genders. For example in context of IEV, Dutch has m, f, n, and m/f.

strogonoff commented 3 years ago

I recommend using fully qualified gender names instead of one-letter abbreviations to reduce ambiguity.

For linguistic background of neuter/common see e.g. https://en.wikipedia.org/wiki/Grammatical_gender

skalee commented 3 years ago

For linguistic background of neuter/common see e.g. https://en.wikipedia.org/wiki/Grammatical_gender

Thanks! It explains everything.

I recommend using fully qualified gender names instead of one-letter abbreviations to reduce ambiguity.

I'm okay with either option.

Still, I'm not sure if set of just four genders will be future-proof. For example, some languages distinguish for example animate and inanimate nouns, and most vocabularies display that next to gender, because it's useful for users. Moreover, some languages (e.g. Polish) distinguish different genders in singular (masculine, feminine, neuter) and in plural (virile, non-virile). These two extra genders in plural can be internally represented as masculine and feminine, and that's probably technically correct, but at some point I guess we'll have to do some mapping in the interface in both Geolexica and Glossarist desktop so that more appropriate verbiage is used.

That said, what you proposed should be enough in context of IEV and I'm okay with that.

strogonoff commented 3 years ago

For linguistic background of neuter/common see e.g. https://en.wikipedia.org/wiki/Grammatical_gender

Thanks! It explains everything.

I recommend using fully qualified gender names instead of one-letter abbreviations to reduce ambiguity.

I'm okay with either option.

Still, I'm not sure if set of just four genders will be future-proof. For example, some languages distinguish for example animate and inanimate nouns, and most vocabularies display that next to gender, because it's useful for users. Moreover, some languages (e.g. Polish) distinguish different genders in singular (masculine, feminine, neuter) and in plural (virile, non-virile). These two extra genders in plural can be internally represented as masculine and feminine, and that's probably technically correct, but at some point I guess we'll have to do some mapping in the interface in both Geolexica and Glossarist desktop so that more appropriate verbiage is used.

That said, what you proposed should be enough in context of IEV and I'm okay with that.

Animate/inanimate property could be added if needed, but like you say, for glossaries we deal with it may not be relevant.

Generally, in linguistics there are different competing ways of classifying verbal expressions. Control bodies can disagree with each other which one they use. Also, they always evolve.

I think user-configurable versioned schemas (like what we are trying to do with generic registry schema) is the way to go. Some vocabularies may need more finely detailed grammatical properties, but for others those properties may not matter.

skalee commented 3 years ago

Generally, in linguistics there are different competing ways of classifying verbal expressions. Control bodies can disagree with each other which one they use. Also, they always evolve.

Indeed, this is my primary concern too. But after your clarifications, what we adopted seems enough for now, at least I haven't found any outstanding case yet. Closing?

strogonoff commented 3 years ago

No objections from my side…

On 10 Jan 2021, at 3:38 PM, Sebastian Skałacki notifications@github.com wrote:

Generally, in linguistics there are different competing ways of classifying verbal expressions. Control bodies can disagree with each other which one they use. Also, they always evolve.

Indeed, this is my primary concern too. But after your clarifications, what we adopted seems enough for now, at least I haven't found any outstanding case yet. Closing?

— You are receiving this because you commented. Reply to this email directly, view it on GitHub, or unsubscribe.

ronaldtse commented 3 years ago

I think user-configurable versioned schemas (like what we are trying to do with generic registry schema) is the way to go. Some vocabularies may need more finely detailed grammatical properties, but for others those properties may not matter.

Agree. It is difficult to have different control bodies agree on an identical set of language gender, so leaving it customizable is easiest for now.

skalee commented 3 years ago

BTW, what's "generic registry schema"? I'm certainly not on the same page here.

strogonoff commented 3 years ago

It’s data schema used by a registry editor GUI currently in development. It doesn’t clash with concept model described here, they are different things.

glossarist / concept-model

Gender normalization (localization) #20