wartaal / HanTa

The Hanover Tagger - A simple approach to lemmatization and POS-tagging of German morphology based on heuristics and hidden markov models
GNU Lesser General Public License v3.0
47 stars 2 forks source link

Problems with recognizing variants of the adjective "teuer" #3

Closed BastianBaumeister closed 2 years ago

BastianBaumeister commented 2 years ago

Sorry if this isn't the proper channel for reporting non-programming related issues.

First of all I want to say, that I'm quite impressed with Hanta. In the last few days I tried many different lemmatization packages in r and python, but most of them were rather lackluster for german text. HanTa on the other recognizes almost anything I throw at it - great!

There just seems to be one word, that the algorithm doesnt "get along" with: "teuer", specifically its many inflections. Some examples:

word: teurere, lemma: teu word: teure, lemma: teu word: überteuerten, lemma: überteuet

other inflections work fine: word: teuerste, lemma: teuer

Besides this one issue I'm really impressed with your work, and I hope the package will be maintained in the future

wartaal commented 2 years ago

Dear Bastian,

thanks a lot for reporting this issue! This is a hard case, at least if we don't want to write a special rule for this single case. For the next version of HanTa I will definitevly try to solve this problem.|

Best

Christian

wartaal commented 2 years ago

Dear Bastian,

I think the problem was mainly in the wrong annotation in the training data. In the new release I hope the number of issues with this type of adjectives is reduced. At least the forms you mentioned are now lemmatized correctly.