Closed juditacs closed 2 years ago
I have re-extracted the data. Do the cases issues seem fixed?
bump @juditacs
Sorry, I missed the notification mail.
It's definitely better, but there are a few errors:
> egrep "(singular|plural)" hun mikan
jeges plural N;NOM;PL
jeges singular N;NOM;SG
való plural N;NOM;PL
való singular N;NOM;SG
ős plural N;NOM;PL
ős singular N;NOM;SG
egész plural N;NOM;PL
egész singular N;NOM;SG
Nouns in dative always end with nak
or nek
:
> grep DAT hun | cut -f2 | grep -v "n[ae]k$" | wc -l
66
examples:
> grep DAT hun | cut -f2 | grep -v "n[ae]k$" | head mikan
gyógyíthatóknak|
hadival
hadiakkal
hallhatóknak|
hamissal
hamisakkal
hasonlíthatóknak|
használhatóknak|
hordozhatóknak|
ihatóknak|
there are many parsing errors where words contain a |
:
> cut -f2 hun| grep "|" | wc -l
10323
In general Hungarian nominal inflection is very regular, so you can easily grep for errors. The only exceptions are the instrumental and translative case which invoke assimilation at the morpheme boundary. The endings are all listed here: https://hungaryforyou.wordpress.com/2013/02/23/noun-cases/
We'd welcome a pull request that fixes any of these that you're able. I'm a bit stretched ragged at the moment.
There are two ways you could fix Hungarian:
hun
itself, as much as you can/want to.In fact, if you want to become the Guardian™ of the Hungarian repo, we'll keep you involved in discussions/plans for upcoming releases/resource papers.
Hi all, the new update fixed these issues. @juditacs you can close this issue if you don't find any similar mistakes :)
Thank you so much for working on this.
I found one issue by computing the Jaccard similarity of lemmas and inflected forms and looking at the lowest values. Some descriptive verbs are only ever used in their 3rd person form and Wiktionary notes this as only "3rd-person forms". These are now parsed as V;IND;PRS;INDF;1;SG
but they really should be skipped.
Examples: https://en.wiktionary.org/wiki/havazik https://en.wiktionary.org/wiki/f%C3%A1j
I found another similar placeholder when I looked at the difference between the length of the lemma and the inflected word: "the verb has no subjunctive forms"
Examples: https://en.wiktionary.org/wiki/fejlik https://en.wiktionary.org/wiki/rejlik
@juditacs Can you open another issue? So we can close this issue, and discuss this last issue's details on the new issue. Thanks for raising it :)
Done.
About 40000 nouns are tagged with an incorrect noun case. I suspect that the Wiktionary inflection tables are parsed incorrectly.
Some (all?) tables of nominal inflection have two columns: singular and plural and the columns are named in a second header line. This line is parsed as inflected forms of the noun and the one line offset is kept throughout.
An example: https://en.wiktionary.org/wiki/bev%C3%A1ndorl%C3%A1s
This is the output (and I added the correct inflected form as well):