unimorph / hun

4 stars 0 forks source link

Missing paradigm placeholder text parsed as inflected word form #8

Open juditacs opened 2 years ago

juditacs commented 2 years ago

I computed the character Jaccard similarity lemmas and inflected forms and I'm looking at the lowest values. Some descriptive verbs are only ever used in their 3rd person form and Wiktionary notes this as only "3rd-person forms". These are now parsed as V;IND;PRS;INDF;1;SG but they really should be skipped.

Examples: https://en.wiktionary.org/wiki/havazik https://en.wiktionary.org/wiki/f%C3%A1j

I found another similar placeholder when I looked at the difference between the length of the lemma and the inflected word: "the verb has no subjunctive forms"

Examples: https://en.wiktionary.org/wiki/fejlik https://en.wiktionary.org/wiki/rejlik

Mentioned in https://github.com/unimorph/hun/issues/1

kbatsuren commented 2 years ago

It is so wonderful way to find a mistake, so should I skip all those entries with 'only 3rd-person forms' as shown in the below image?

image

By the way, I fully agree with you that the subjunctive forms should be removed (Wiktionarians may have different ideas on them, unfortunately, we may never know about that)

juditacs commented 2 years ago

Yes, I think they should be skipped since they are since 1. they are not used, 2. the actual inflected form (if it exists at all, some don't) is not specified in Wiktionary.

Is there a Unimorph guideline for these cases? I doubt it only pertains to Hungarian.