Closed vizzerdrix55 closed 5 years ago
You are right: SoMeWeTa is only a part-of-speech tagger and not a lemmatizer. At the moment, lemmatization is simply not implemented.
Our research group is definitely interested in lemmatization and works towards creating a lemmatizer that performs well on German web and social media texts. So far, we've already created a new gold standard by manually lemmatizing the EmpiriST dataset. For the time being, however, you need to use some other tool for lemmatization.
Regarding the mapping of the POS tags: Since STTS_IBK is an extension or refinement of STTS, a mapping from STTS_IBK to STTS should not be too difficult and can conveniently be represented as a dictionary. You can find such a mapping for example in Rehbein et al. (2018). With the exception of the new contraction tags (.+PPER
and ADVART
), the additional tags of STTS_IBK can be trivially lemmatized, anyway: The lemma should be the surface form in most cases.
As far as I see, there is no lemmatizer built in in SoMeWeTa. It would be more comfortable to have a third column in the output that contains the lemma. My output looks like this:
And what I mention is something like this:
This feature is supported e.g. by Texblob (see Words Inflection and Lemmatization) and Stanford CoreNLP. The consideration of the PoS-tags from SoMeWeTa in common lemmatisers (see this overview of 7 lemmatisers) can in my eyes only be done with a mapping of the PoS-tags which is time-consuming, non-pythonic and probably leads to a loss of information because of missing corresponding tags in the destination tagset.