Open nkrot opened 7 years ago
That's a bug, on left column in de-lemma-utf8.txt
, every word should occur at most once.
Part-of-speech is out of scope of the baseform token filter. For this, a wordnet-like input would be required with an NLP plugin (for POS tagging).
Hopefully you agree that a single word form can be transformed into 1+ baseforms. This is the main idea of my initial post: if no PoS information is available, it is reasonable to assume any PoS and produce all possible base forms. Here you are an example of two different lemmata having the same derived forms:
leaves leaf
leaves leave
If the left column is supposed to contain unique words only, how will multiple outcomes be given? Like this:
Zuschlage Zuschlag,zuschlagen
It is also possible to accomplish such merging at load/compile time. This way it is a little bit easier for the the users who may want to update the resource.
Situation: The baseform resource
de-lemma-utf8.txt
defines various outcomes for one input word, for example,I would expect that all outcomes will be returned, as the correct baseform depends on the part of speech.
If the resource is used case-insensitively, the number of such collisions will increase, now comprising cases like:
Would it be possible to fix the plugin to return all entries given in the resource?
Thanx