olesar / UD_Lithuanian

Resourses and documentation for a Lithuanian Universal Dependencies treebank
https://github.com/UniversalDependencies/UD_Lithuanian-HSE/blob/master/README.md
GNU General Public License v3.0
1 stars 1 forks source link

Lemmatization of verbs with prefixes #5

Open olesar opened 5 years ago

olesar commented 5 years ago

5 types of lemmas:

olesar commented 5 years ago

Remove ne-, nebe- prefixes in the verb lemmas in Lithuanian-HSE putting aside only those which cannot be used without prefixes. The data will be more parallel to Latvian v.2.3, Czech, etc. Create a list of known exceptions and morphophonemic issues. Example (with -e- omitted in tebe-): teblikę tebelikti VERB Definite=Ind|Gender=Neut|Reflex=No|Tense=Past|VerbForm=Part|Voice=Act

olesar commented 5 years ago

What about te-?

Nofenigma commented 5 years ago

An important remark: no actual dictionary of Lithuanian (including the most representative LKŽ) includes these derived forms as separate lemmas. The interpretation by the Kaunas team seems to simply be a cheap, "lemmatize-by-common-rules" solution. These prefixes (ne-, be-, te-) are very productive, more productive and regular than ordinary verbal prefixes. NB: the Latvian team previously lemmatized negated verbs as separate lemmas as well, but they fixed it in the last version (https://github.com/UniversalDependencies/UD_Latvian-LVTB/issues/2). So now it seems even necessary to make lemmatization more uniform.

There seems to be no reasonably representative (big) machine-readable dictionary to check as many lemmas as possible, so that we could remove prefixes and check if the lemma is in the dictionary.

Possible solutions

Variant 1.

Write a script that checks whether strings without prefixes or prefix combinations are attested in the frequency corpus-based dictionary (http://donelaitis.vdu.lt/publikacijos/dazninis-TXT.zip); by the way, this dictionary also has all these additional lemmas, which, of course, should influence the statistics for verbs.

Variant 2. Use additional dictionary resources for those lemma candidates that are not found in the frequency dictionary. LKŽ: http://lkz.lt/ (NB: it has a tricky way of representing derivatives in its entries, so it would require accurate manual search in most cases, hard to automatize); 236 000 entries, but is really comprehensive (dialectal words, obsolete words, etc.) DLKŽ (the Dictionary of Modern Lithuanian): http://lkiis.lki.lt/dabartinis; about 50 000 entries

NB: both variants presume the existence of the "stop list" with lexemes (lemmas) that do start with ne-, be- or te-. One should start with this step first (again, at least check the frequency dictionary).

Nofenigma commented 5 years ago

I have just found this, seems to be what I was looking for (a corpus-based wordlist, should be comprehensive enough): https://clarin.vdu.lt/xmlui/handle/20.500.11821/8 But the problem is that it's not lemmatized, it gives frequencies for wordforms only...