wartaal / HanTa

The Hanover Tagger - A simple approach to lemmatization and POS-tagging of German morphology based on heuristics and hidden markov models
GNU Lesser General Public License v3.0
47 stars 2 forks source link

Wrong lemmatization of 'zoom' #5

Open OxyMal opened 2 years ago

OxyMal commented 2 years ago

Hi, there's a problem with the lemmatization of a word Zoom, which becomes "zoo".

wartaal commented 2 years ago

That is strange. The word is not in the training data (I guess), but it shouldn't emove the last letter. I will have a look and try to fox it for a future version (I guess by the end of the summer, there will be a new version with a bundle of small improvements).

OxyMal commented 2 years ago

Thank you for your quick reply. If it's possible, could you also check the lemmatization of "kannst", which becomes "kannen" and "Buche", which becomes Buch. Thanks a lot in advance!

wartaal commented 2 years ago

Buche is a problem, because it could indeed be the dative of Buch. Zoom is an intersting one. The word Zoom is not in the training data, but Zoo is. Moreover it learned from the training data that 'm' is a valid noun suffix, found in words like 'Unsichtbarem' which is annotated as a noun withe the lemma 'Unsichtbare'. So, to solve the problem with Zoom, I have to find a better solution for nominalized adjectives and annotate them better in the training data. Nice work for the summer vacation!

OxyMal commented 2 years ago

in case of "Buche" is feminine and is in plural "Buchen". In case of "Buch", which is neutral, the plular form is "Bücher". And they also have different forms in Akk: "Buche" vs. "Buch"; in Dativ: "Buche" vs. "Buch", etc. Maybe it's possible to add them in correct forms for better differentiation. Nominalized adjectives could be difficult, it's true. Mabe it could help to add more weight if such adjectives are used together with articles or demonstrative pronouns like "diese, dieser.." Just an idea :) but yeah, thanks a lot! And have a good vacation!

wartaal commented 1 year ago

In the latest version the analysis of 'Zoom' is still wrong, but at least it gets a correct lemma. The problem with 'kannst' is solved. 'Buche' still is a problem. At least in context, like in "die Buche", this should work corectly. Adding more training data and gender features to nouns in the training data might solve this problem.

leduvu commented 1 year ago

The German word "Lehrplan" also gets strange lemmatization results: Lehrplan [('lehrpla', 'NN'), ('n', 'SUF_NN')] Lehrpläne [('lehrpläne', 'NE')] Lehrplans [('lehrplan', 'NE'), ('s', 'SUF_NE')]

wartaal commented 1 year ago

Thanks for testing. I will think about a solution, but this doesn't seem to be easy. There are no rules in the program but everything is learned from the training data. SO I cannot just fix a rule to handle this.

Obviously, 'Lehrplan' is not in the trainingsdata and HanTa cannot analyse it as a compound, because there is no noun "Lehr' (without an e at the end).

ANyway, keep reporting such cases! Some day I will find a solution.

H4rryK4ne commented 1 year ago

I have a similar issue with the sentence ["alte", "Firmware"] -> [('alte', 'alt', 'ADJ(A)'), ('Firmware', 'Firmwar', 'NN')] Firmware looses its e.

Interestingly, it does not loose the e if it is stand alone ["Firmware"] -> [('Firmware', 'Firmware', 'NE')]

wartaal commented 1 year ago

O, that is an interesting one! Actually, I have no idea how to treat unknown loanwords, or how to recognize them in the first place. However, the algorithm should be less eager in finding suffixes for unknown words. That might indeed be an issue to spend some time on.

joprice commented 11 months ago

Another similar one I hit is Stuntman -> Stuntma

wartaal commented 11 months ago

Thanks! Always good to have some cases to work on ;-)

bizrockman commented 4 months ago

Well, in that case having a similar issue ;-)

print(tagger.analyze('Edelstein', taglevel=3)) ('edelstei', [('edelstei', 'NN'), ('n', 'SUF_NN')], 'NN')

wartaal commented 4 months ago

Thanks! This last one could be solved by annotating adj-noun compounds appropriately in the training data.