stanfordnlp / stanza

Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages
https://stanfordnlp.github.io/stanza/
Other
7.21k stars 885 forks source link

Lemmatization of Possessive Case markers in Hindi and Urdu #1067

Closed raydoc closed 2 years ago

raydoc commented 2 years ago

Hi, I was checking your lemmatization for Hindi and Urdu and found that possessive [genitive] case markers in Hindi and Urdu are wrongly lemmatized. It refers to the Hindi possessive case markers: का की के I have noticed that your lemmatiser tends to reduce these [maybe because they are using builtin libraries] to a single form का लड़की के मामा की बहन Lemmatized form लड़की का मामा का बहन I personally do not agree with this approach not only because /ka/ is not the base form [lemma] of /ki/ and /ke/ but also because it is "sexist" in nature and reduces feminine and feminine/masc plural to a masculine singular form. I do no see 'her' or 'sa' or ihre' in En Fr Ger reduced to 'him', 'son','ihr' which it should by the same logic. The same scenario is in Urdu Stanford's Stanza reduces Urdu genitive case markers /ka/ ke/ ki/ to /ka/ Here is an output of the Urdu sentence مھمد کی کتاب اور ھسن کے گھر Lemma: مھمد کا کتاب اور ھسن کا گھر I believe the library which does this is at fault. I consulted my colleagues who are linguists in Hindi and Urdu and alss work in the area of NLP and we feel this approach is linguistically incorrect and worse still smacks of sexism. I do not think it is right to reduce a feminine form to a masculine. my email: raymond.doctor@gmail.com I hope a more rational approach to this will be adopted.

AngledLuffa commented 2 years ago

The simple fact is we don't have anyone who speaks Hindi working on this project. The models are trained from the Hindi and Urdu datasets available from Universal Dependencies without any human intervention:

https://universaldependencies.org/

https://github.com/UniversalDependencies/UD_Hindi-HDTB

https://github.com/UniversalDependencies/UD_Urdu-UDTB

I don't even know which of three possessive case markers are male, female, or neutral. Google translate doesn't distinguish them. Your message assumes we'll know which ones are which. Hopefully your next message will adopt a more rational approach to explaining what the problem is.

On Wed, Jun 29, 2022 at 9:10 PM raydoc @.***> wrote:

Hi, I was checking your lemmatization for Hindi and Urdu and found that possessive [genitive] case markers in Hindi and Urdu are wrongly lemmatized. It refers to the Hindi possessive case markers: का की के I have noticed that your lemmatiser tends to reduce these [maybe because they are using builtin libraries] to a single form का लड़की के मामा की बहन Lemmatized form लड़की का मामा का बहन I personally do not agree with this approach not only because /ka/ is not the base form [lemma] of /ki/ and /ke/ but also because it is "sexist" in nature and reduces feminine and feminine/masc plural to a masculine singular form. I do no see 'her' or 'sa' or ihre' in En Fr Ger reduced to 'him', 'son','ihr' which it should by the same logic. The same scenario is in Urdu Stanford's Stanza reduces Urdu genitive case markers /ka/ ke/ ki/ to /ka/ Here is an output of the Urdu sentence مھمد کی کتاب اور ھسن کے گھر Lemma: مھمد کا کتاب اور ھسن کا گھر I believe the library which does this is at fault. I consulted my colleagues who are linguists in Hindi and Urdu and alss work in the area of NLP and we feel this approach is linguistically incorrect and worse still smacks of sexism. I do not think it is right to reduce a feminine form to a masculine. my email: @.*** I hope a more rational approach to this will be adopted.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1067, or unsubscribe https://github.com/notifications/unsubscribe-auth/AA2AYWK6S5K62NGJKKURNP3VRUMZVANCNFSM52H4NIUA . You are receiving this because you are subscribed to this thread.Message ID: @.***>

raydoc commented 2 years ago

Hi, I assumed someone knew Hindi/Urdu and hence did not go into details. I'll taake Hindi and Urdu one by one, although they both exhibit the same problem HINDI The issue is as under: Hindi admits 3 case markers का /ka/ की /kii/ के /ke/ . These case markers agree in number and gender with the possessed. If the object/person is masculine singular का is used. If Feminine singular की and if Plural masc/feminine के is used Examples: राम का भाई Ram's brother: ka because Brother is masculine राम की बहन Ram's sister: kii because Sister is feminine राम के मित्रों : Ram's friends: ke because friends is plural As you can see the genitive is marked for number and gender of the possessed. Reducing them to the masculine singular /ka/ is not a right approach and as I mentioned is sexist in approach. This is like lemmatising her to his in English. URDU In the case of Urdu, the scenario is similar. I will take the same examples to make comprehension easier: Urdu admits 3 similar case markers کا /ka/ کی /kii/ کے /ke/ . These case markers agree in number and gender with the possessed. If the object/person is masculine singular کا is used. If Feminine singular کی and if Plural masc/feminine کے is used Examples: رام کا بھائی: Ram's brother: ka because Brother is masculine رام کی بہن: Ram's sister: kii because Sister is feminine رام کےمتروں: Ram's friends: ke because friends is plural As you can see the genitive is marked for number and gender of the possessed, unlike English.: his book, her book Reducing them to the masculine singular /ka/ is not a right approach I trust this explanation will help you see the problem in its right perspective. Thank you

Best regards,

Doc

On Thu, Jun 30, 2022 at 12:50 PM John Bauer @.***> wrote:

The simple fact is we don't have anyone who speaks Hindi working on this project. The models are trained from the Hindi and Urdu datasets available from Universal Dependencies without any human intervention:

https://universaldependencies.org/

https://github.com/UniversalDependencies/UD_Hindi-HDTB

https://github.com/UniversalDependencies/UD_Urdu-UDTB

I don't even know which of three possessive case markers are male, female, or neutral. Google translate doesn't distinguish them. Your message assumes we'll know which ones are which. Hopefully your next message will adopt a more rational approach to explaining what the problem is.

On Wed, Jun 29, 2022 at 9:10 PM raydoc @.***> wrote:

Hi, I was checking your lemmatization for Hindi and Urdu and found that possessive [genitive] case markers in Hindi and Urdu are wrongly lemmatized. It refers to the Hindi possessive case markers: का की के I have noticed that your lemmatiser tends to reduce these [maybe because they are using builtin libraries] to a single form का लड़की के मामा की बहन Lemmatized form लड़की का मामा का बहन I personally do not agree with this approach not only because /ka/ is not the base form [lemma] of /ki/ and /ke/ but also because it is "sexist" in nature and reduces feminine and feminine/masc plural to a masculine singular form. I do no see 'her' or 'sa' or ihre' in En Fr Ger reduced to 'him', 'son','ihr' which it should by the same logic. The same scenario is in Urdu Stanford's Stanza reduces Urdu genitive case markers /ka/ ke/ ki/ to /ka/ Here is an output of the Urdu sentence مھمد کی کتاب اور ھسن کے گھر Lemma: مھمد کا کتاب اور ھسن کا گھر I believe the library which does this is at fault. I consulted my colleagues who are linguists in Hindi and Urdu and alss work in the area of NLP and we feel this approach is linguistically incorrect and worse still smacks of sexism. I do not think it is right to reduce a feminine form to a masculine. my email: @.*** I hope a more rational approach to this will be adopted.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1067, or unsubscribe < https://github.com/notifications/unsubscribe-auth/AA2AYWK6S5K62NGJKKURNP3VRUMZVANCNFSM52H4NIUA

. You are receiving this because you are subscribed to this thread.Message ID: @.***>

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1067#issuecomment-1170859124, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL724I7IQXOJDH4SWJPU7HDVRVDC5ANCNFSM52H4NIUA . You are receiving this because you authored the thread.Message ID: @.***>

AngledLuffa commented 2 years ago

Thank you for the detailed explanation. Now that I know what to look for, I will claim that this is an issue with the underlying data. Especially true for languages where we don't have any expertise of our own (which is unfortunately true for Hindi), we simply put the data into our model training, and whatever comes out is what comes out.

The datasets are here:

https://universaldependencies.org/ https://github.com/UniversalDependencies/UD_Hindi-HDTB https://github.com/UniversalDependencies/UD_Urdu-UDTB

So, for example with the Hindi dataset, I can grep for की in the dataset.

(Note: the fields are tab separated, so you can grep for exactly that word by surrounding the character with tabs. You can put a tab in a bash shell with ctrl-V tab. You may already know all of that.)

The results of grepping for की look like:

3       की      का      ADP     PSP     AdpType=Post|Case=Acc|Gender=Fem|Number=Plur    2       case    _       ChunkId=NP|ChunkType=child|Translit=kī
3       की      का      ADP     PSP     AdpType=Post|Case=Nom|Gender=Fem|Number=Sing    2       case    _       ChunkId=NP2|ChunkType=child|Translit=kī

So you can see, the underlying dataset turns it into the male form.

For के, the results are less consistent. I'll leave a bit more context in case you can explain why it is doing things differently. Sometimes it keeps it the same, and sometimes it switches it to का

10      जाने     जा      VERB    VM      Case=Acc|VerbForm=Inf   16      advcl   _       Vib=ना_के_लिए|Tam=nA|ChunkId=VGNN|ChunkType=head|Translit=jāne
11      के       के       ADP     PSP     AdpType=Post    10      mark    _       ChunkId=VGNN|ChunkType=child|Translit=ke
12      लिए     लिए     ADP     PSP     AdpType=Post    10      mark    _       ChunkId=VGNN|ChunkType=child|Translit=lie
--
9       देश      देश      NOUN    NN      Case=Acc|Gender=Masc|Number=Sing|Person=3       11      nmod    _       Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=deśa
10      के       का      ADP     PSP     AdpType=Post|Case=Acc|Gender=Masc|Number=Plur   9       case    _       ChunkId=NP4|ChunkType=child|Translit=ke
11      लोगों    लोग     NOUN    NN      Case=Acc|Gender=Masc|Number=Plur|Person=3       14      obj     _       Vib=0_को|Tam=0|ChunkId=NP5|ChunkType=head|Translit=logoṁ

I suggest going through the dataset some yourself to see if there's a reasonable standard or if you think it should be changed. The fact that के isn't consistent is a little suspicious to me, if nothing else. Plus, as you point out, most other language datasets don't ignore the gender in the lemma.

Anyway, a lot of the dataset maintainers are pretty responsive to issues. You could create an issue or even a pull request against the Hindi dataset if you think the lemmas should be updated to reflect the gender of the pronoun. I would suggest starting from a more neutral attitude rather than going straight to calling the dataset sexist, though :) If you do effect some changes in those datasets, we can retrain the models at any point, not necessarily when UD 2.11 comes out. Alternatively, we can always train the models from a fork of the dataset if it seems they are not responding and you are certain your change is an improvement.

BTW, the reason I like this job is learning interesting tidbits about other languages - in English, the gender of the subject determines the pronoun, whereas the gender of the object determines the pronoun in Hindi & Urdu.

raydoc commented 2 years ago

Hi, I'll go through the datasets and get back to you. A little clarification: In the case of

11 के के ADP PSP AdpType=Post 10 mark _ ChunkId=VGNN|ChunkType=child|Translit=ke

12 लिए लिए ADP PSP AdpType=Post 10 mark _ ChunkId=VGNN|ChunkType=child|Translit=lie

के लिए / ke lie/, /ke/ does not lemmatize to /ka/ because /ke lie/ constitutes one single unit , roughly translated as for which/whom and demands s a pronoun/noun before it. In this case: देश [country]. The whole construct means 'for the country' sake'.

Hopefully one more tidbit to add.

Best regards,

Doc

On Fri, Jul 1, 2022 at 11:57 AM John Bauer @.***> wrote:

Thank you for the detailed explanation. Now that I know what to look for, I will claim that this is an issue with the underlying data. Especially true for languages where we don't have any expertise of our own (which is unfortunately true for Hindi), we simply put the data into our model training, and whatever comes out is what comes out.

The datasets are here:

https://universaldependencies.org/ https://github.com/UniversalDependencies/UD_Hindi-HDTB https://github.com/UniversalDependencies/UD_Urdu-UDTB

So, for example with the Hindi dataset, I can grep for की in the dataset.

(Note: the fields are tab separated, so you can grep for exactly that word by surrounding the character with tabs. You can put a tab in a bash shell with ctrl-V tab. You may already know all of that.)

The results of grepping for की look like:

3 की का ADP PSP AdpType=Post|Case=Acc|Gender=Fem|Number=Plur 2 case _ ChunkId=NP|ChunkType=child|Translit=kī

3 की का ADP PSP AdpType=Post|Case=Nom|Gender=Fem|Number=Sing 2 case _ ChunkId=NP2|ChunkType=child|Translit=kī

So you can see, the underlying dataset turns it into the male form.

For के, the results are less consistent. I'll leave a bit more context in case you can explain why it is doing things differently. Sometimes it keeps it the same, and sometimes it switches it to का

10 जाने जा VERB VM Case=Acc|VerbForm=Inf 16 advcl _ Vib=ना_के_लिए|Tam=nA|ChunkId=VGNN|ChunkType=head|Translit=jāne

11 के के ADP PSP AdpType=Post 10 mark _ ChunkId=VGNN|ChunkType=child|Translit=ke

12 लिए लिए ADP PSP AdpType=Post 10 mark _ ChunkId=VGNN|ChunkType=child|Translit=lie

--

9 देश देश NOUN NN Case=Acc|Gender=Masc|Number=Sing|Person=3 11 nmod _ Vib=0_का|Tam=0|ChunkId=NP4|ChunkType=head|Translit=deśa

10 के का ADP PSP AdpType=Post|Case=Acc|Gender=Masc|Number=Plur 9 case _ ChunkId=NP4|ChunkType=child|Translit=ke

11 लोगों लोग NOUN NN Case=Acc|Gender=Masc|Number=Plur|Person=3 14 obj _ Vib=0_को|Tam=0|ChunkId=NP5|ChunkType=head|Translit=logoṁ

I suggest going through the dataset some yourself to see if there's a reasonable standard or if you think it should be changed. The fact that के isn't consistent is a little suspicious to me, if nothing else. Plus, as you point out, most other language datasets don't ignore the gender in the lemma.

Anyway, a lot of the dataset maintainers are pretty responsive to issues. You could create an issue or even a pull request against the Hindi dataset if you think the lemmas should be updated to reflect the gender of the pronoun. I would suggest starting from a more neutral attitude rather than going straight to calling the dataset sexist, though :) If you do effect some changes in those datasets, we can retrain the models at any point, not necessarily when UD 2.11 comes out. Alternatively, we can always train the models from a fork of the dataset if it seems they are not responding and you are certain your change is an improvement.

BTW, the reason I like this job is learning interesting tidbits about other languages - in English, the gender of the subject determines the pronoun, whereas the gender of the object determines the pronoun in Hindi & Urdu.

— Reply to this email directly, view it on GitHub https://github.com/stanfordnlp/stanza/issues/1067#issuecomment-1171985097, or unsubscribe https://github.com/notifications/unsubscribe-auth/AL724IYM3O673QC7G6JU7YTVR2FUZANCNFSM52H4NIUA . You are receiving this because you authored the thread.Message ID: @.***>

stale[bot] commented 2 years ago

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

AngledLuffa commented 2 years ago

Any luck sorting out the different lemmas in the datasets? Happy to rebuild the models for those languages if we make an improvement to the data.

stale[bot] commented 2 years ago

This issue has been automatically closed due to inactivity.