tarsqi / ttk

Tarsqi Toolkit
Apache License 2.0
25 stars 10 forks source link

Add lemmatizer #6

Open marcverhagen opened 8 years ago

marcverhagen commented 8 years ago

The TreeTagger misses a lot of lemmas and has the habit of then putting <unknown> in the lemma field. Replace those values with values from a WordNet or UMLS lookup (the latter for the medical genre). This is most naturally done just after text = self.tag_text(tokens) in PreProcessorWrapper.process() in components.preprocessor.wrapper.

It may be useful to have a look at the UMLS Lexical Variant Grammar.