Open OxyMal opened 2 years ago
That is strange. The word is not in the training data (I guess), but it shouldn't emove the last letter. I will have a look and try to fox it for a future version (I guess by the end of the summer, there will be a new version with a bundle of small improvements).
Thank you for your quick reply. If it's possible, could you also check the lemmatization of "kannst", which becomes "kannen" and "Buche", which becomes Buch. Thanks a lot in advance!
Buche is a problem, because it could indeed be the dative of Buch. Zoom is an intersting one. The word Zoom is not in the training data, but Zoo is. Moreover it learned from the training data that 'm' is a valid noun suffix, found in words like 'Unsichtbarem' which is annotated as a noun withe the lemma 'Unsichtbare'. So, to solve the problem with Zoom, I have to find a better solution for nominalized adjectives and annotate them better in the training data. Nice work for the summer vacation!
in case of "Buche" is feminine and is in plural "Buchen". In case of "Buch", which is neutral, the plular form is "Bücher". And they also have different forms in Akk: "Buche" vs. "Buch"; in Dativ: "Buche" vs. "Buch", etc. Maybe it's possible to add them in correct forms for better differentiation. Nominalized adjectives could be difficult, it's true. Mabe it could help to add more weight if such adjectives are used together with articles or demonstrative pronouns like "diese, dieser.." Just an idea :) but yeah, thanks a lot! And have a good vacation!
In the latest version the analysis of 'Zoom' is still wrong, but at least it gets a correct lemma. The problem with 'kannst' is solved. 'Buche' still is a problem. At least in context, like in "die Buche", this should work corectly. Adding more training data and gender features to nouns in the training data might solve this problem.
The German word "Lehrplan" also gets strange lemmatization results: Lehrplan [('lehrpla', 'NN'), ('n', 'SUF_NN')] Lehrpläne [('lehrpläne', 'NE')] Lehrplans [('lehrplan', 'NE'), ('s', 'SUF_NE')]
Thanks for testing. I will think about a solution, but this doesn't seem to be easy. There are no rules in the program but everything is learned from the training data. SO I cannot just fix a rule to handle this.
Obviously, 'Lehrplan' is not in the trainingsdata and HanTa cannot analyse it as a compound, because there is no noun "Lehr' (without an e at the end).
ANyway, keep reporting such cases! Some day I will find a solution.
I have a similar issue with the sentence ["alte", "Firmware"] -> [('alte', 'alt', 'ADJ(A)'), ('Firmware', 'Firmwar', 'NN')]
Firmware looses its e
.
Interestingly, it does not loose the e
if it is stand alone ["Firmware"] -> [('Firmware', 'Firmware', 'NE')]
O, that is an interesting one! Actually, I have no idea how to treat unknown loanwords, or how to recognize them in the first place. However, the algorithm should be less eager in finding suffixes for unknown words. That might indeed be an issue to spend some time on.
Another similar one I hit is Stuntman
-> Stuntma
Thanks! Always good to have some cases to work on ;-)
Well, in that case having a similar issue ;-)
print(tagger.analyze('Edelstein', taglevel=3)) ('edelstei', [('edelstei', 'NN'), ('n', 'SUF_NN')], 'NN')
Thanks! This last one could be solved by annotating adj-noun compounds appropriately in the training data.
Hi, there's a problem with the lemmatization of a word Zoom, which becomes "zoo".