Closed onurgu closed 6 years ago
The current version of CSTlemma cannot do what you want. CSTlemma can only output PoS-tags that are already present in the input. So either the input is tagged and so is the output, or the input is not tagged, and neither the output.
I have an experimental version of a lemmatiser that does what you propose, but it has not yet been galvanised into C++ code and is therefore not very fast. And it requires the new style rules. Since you ask, I will put the functionality higher on the CSTlemma wish list.
Thank you, is it possible for you to share the experimental version you mentioned even if it is not super fast?
Yes, it is https://github.com/kuhumcst/affixtrain/blob/master/example/LemmaVal.bra. Yesterday I discovered that I already had shared it! I am working on an improved version (documentation, better functionality), so check for updates one of these days.
I recommend that you use the program in this new repo:
Thank you very much, I will look into that.
I try to run the tool on the sentence
Proces privatizacije na Kosovu pod lupom
using the following command line parameters with the purpose of obtaining all possible (ambiguous) morphological tags:However I am not satisfied with the results. I can't speak Croatian or know details about the language but I was expecting to get an output like the following:
What am I missing? (I made up the missing tags and roots)
For example in Turkish, the word
elması
might refer to different objects because of the suffix and needs to be disambiguated using the sentence context. So we have two possible tags forelması
: i) elma+P3sg ii) elmas+P3sgI am not sure Croatian has this type of ambiguation but I was guessing it might.
The desired output in general: