kuhumcst / cstlemma

Lemmatiser for Danish, Dutch, English, German, Polish, Romanian, Russian and tens of other languages, that uses affix rules (affix: prefix, infix, suffix, circumfix). Rules are obtained by supervised learning from a full form - lemma list.
GNU General Public License v2.0
35 stars 7 forks source link

How to properly run the command to get all possible morphological tags for every word in a sentence? #3

Closed onurgu closed 6 years ago

onurgu commented 6 years ago

I try to run the tool on the sentence Proces privatizacije na Kosovu pod lupom using the following command line parameters with the purpose of obtaining all possible (ambiguous) morphological tags:

$ ./cstlemma -f ../../7-peglanje/setimes.hr.v1.2.manual.cstpats -d ../../7-peglanje/setimes.hr.v1.2.manual.cstdict -i deneme.txt -b'$w/$t\n'
Old style rules. First four bytes, as int: 66636741. As char*:
Agcf
na/Sl
pod/Si
privatizacija/Ncfsg

However I am not satisfied with the results. I can't speak Croatian or know details about the language but I was expecting to get an output like the following:

Proces privatizacije na Kosovu pod lupom
Proces/Tag1
privatizacija/Ncfsg
na/Sl
Kosovu/Tag2
pod/Si
lupom/Tag3

What am I missing? (I made up the missing tags and roots)

For example in Turkish, the word elması might refer to different objects because of the suffix and needs to be disambiguated using the sentence context. So we have two possible tags for elması: i) elma+P3sg ii) elmas+P3sg

I am not sure Croatian has this type of ambiguation but I was guessing it might.

The desired output in general:

word1 candidatetag1 candidatetag2
word2 candidatetag1
word3 candidatetag4 candidatetag5
...
BartJongejan commented 6 years ago

The current version of CSTlemma cannot do what you want. CSTlemma can only output PoS-tags that are already present in the input. So either the input is tagged and so is the output, or the input is not tagged, and neither the output.

I have an experimental version of a lemmatiser that does what you propose, but it has not yet been galvanised into C++ code and is therefore not very fast. And it requires the new style rules. Since you ask, I will put the functionality higher on the CSTlemma wish list.

onurgu commented 6 years ago

Thank you, is it possible for you to share the experimental version you mentioned even if it is not super fast?

BartJongejan commented 6 years ago

Yes, it is https://github.com/kuhumcst/affixtrain/blob/master/example/LemmaVal.bra. Yesterday I discovered that I already had shared it! I am working on an improved version (documentation, better functionality), so check for updates one of these days.

BartJongejan commented 6 years ago

I recommend that you use the program in this new repo:

https://github.com/kuhumcst/LemmaX

onurgu commented 6 years ago

Thank you very much, I will look into that.