baseform: less word forms returned than defined in the resource

jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins

GNU Affero General Public License v3.0

110 stars 17 forks source link

baseform: less word forms returned than defined in the resource #31

Open nkrot opened 7 years ago

nkrot commented 7 years ago

Situation: The baseform resource de-lemma-utf8.txt defines various outcomes for one input word, for example,

Zuschlage   Zuschlag
Zuschlage   zuschlagen

I would expect that all outcomes will be returned, as the correct baseform depends on the part of speech.

If the resource is used case-insensitively, the number of such collisions will increase, now comprising cases like:

Gefahren    Gefahr
gefahren    fahren

Would it be possible to fix the plugin to return all entries given in the resource?

Thanx

jprante commented 7 years ago

That's a bug, on left column in de-lemma-utf8.txt, every word should occur at most once.

Part-of-speech is out of scope of the baseform token filter. For this, a wordnet-like input would be required with an NLP plugin (for POS tagging).

nkrot commented 7 years ago

Hopefully you agree that a single word form can be transformed into 1+ baseforms. This is the main idea of my initial post: if no PoS information is available, it is reasonable to assume any PoS and produce all possible base forms. Here you are an example of two different lemmata having the same derived forms:

leaves       leaf
leaves       leave

If the left column is supposed to contain unique words only, how will multiple outcomes be given? Like this:

Zuschlage     Zuschlag,zuschlagen

It is also possible to accomplish such merging at load/compile time. This way it is a little bit easier for the the users who may want to update the resource.