unimorph / hun

4 stars 0 forks source link

Noun case errors #1

Closed juditacs closed 2 years ago

juditacs commented 6 years ago

About 40000 nouns are tagged with an incorrect noun case. I suspect that the Wiktionary inflection tables are parsed incorrectly.

Some (all?) tables of nominal inflection have two columns: singular and plural and the columns are named in a second header line. This line is parsed as inflected forms of the noun and the one line offset is kept throughout.

An example: https://en.wiktionary.org/wiki/bev%C3%A1ndorl%C3%A1s

This is the output (and I added the correct inflected form as well):

lemma inflected (current) labels inflected (correct) labels (as in the wrong inflection)
bevándorlás plural N;NOM;PL bevándorlás
bevándorlás singular N;NOM;SG bevándorlások
bevándorlás bevándorlás N;ACC;SG bevándorlást N;NOM;SG
bevándorlás bevándorlásnak N;INST;SG bevándorlással N;DAT;SG
ckirov commented 6 years ago

I have re-extracted the data. Do the cases issues seem fixed?

aryamccarthy commented 5 years ago

bump @juditacs

juditacs commented 5 years ago

Sorry, I missed the notification mail.

It's definitely better, but there are a few errors:

> egrep "(singular|plural)" hun                                                                                                                    mikan
jeges   plural  N;NOM;PL
jeges   singular        N;NOM;SG
való    plural  N;NOM;PL
való    singular        N;NOM;SG
ős      plural  N;NOM;PL
ős      singular        N;NOM;SG
egész   plural  N;NOM;PL
egész   singular        N;NOM;SG

Nouns in dative always end with nak or nek:

> grep DAT hun | cut -f2 | grep -v "n[ae]k$" | wc -l                                                                                               
66

examples:

> grep DAT hun | cut -f2 | grep -v "n[ae]k$" | head                                                                                                mikan
gyógyíthatóknak|
hadival
hadiakkal
hallhatóknak|
hamissal
hamisakkal
hasonlíthatóknak|
használhatóknak|
hordozhatóknak|
ihatóknak|

there are many parsing errors where words contain a |:

> cut -f2 hun| grep "|" | wc -l                                                                                                                                                                                        
10323

In general Hungarian nominal inflection is very regular, so you can easily grep for errors. The only exceptions are the instrumental and translative case which invoke assimilation at the morpheme boundary. The endings are all listed here: https://hungaryforyou.wordpress.com/2013/02/23/noun-cases/

aryamccarthy commented 5 years ago

We'd welcome a pull request that fixes any of these that you're able. I'm a bit stretched ragged at the moment.

There are two ways you could fix Hungarian:

  1. Modify hun itself, as much as you can/want to.
  2. Modify the extractor, so that the pipeline is Reproducible™.

In fact, if you want to become the Guardian™ of the Hungarian repo, we'll keep you involved in discussions/plans for upcoming releases/resource papers.

kbatsuren commented 2 years ago

Hi all, the new update fixed these issues. @juditacs you can close this issue if you don't find any similar mistakes :)

juditacs commented 2 years ago

Thank you so much for working on this.

I found one issue by computing the Jaccard similarity of lemmas and inflected forms and looking at the lowest values. Some descriptive verbs are only ever used in their 3rd person form and Wiktionary notes this as only "3rd-person forms". These are now parsed as V;IND;PRS;INDF;1;SG but they really should be skipped.

Examples: https://en.wiktionary.org/wiki/havazik https://en.wiktionary.org/wiki/f%C3%A1j

I found another similar placeholder when I looked at the difference between the length of the lemma and the inflected word: "the verb has no subjunctive forms"

Examples: https://en.wiktionary.org/wiki/fejlik https://en.wiktionary.org/wiki/rejlik

kbatsuren commented 2 years ago

@juditacs Can you open another issue? So we can close this issue, and discuss this last issue's details on the new issue. Thanks for raising it :)

juditacs commented 2 years ago

Done.