jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins
GNU Affero General Public License v3.0
110 stars 17 forks source link

generating baseform inside decompunder vs. standalone baseform. redundant #32

Open nkrot opened 7 years ago

nkrot commented 7 years ago

I have skimmed through the code of decompounder plugin and noticed that in addition to doing decompounding itself, it generates baseform of the last word. While it is good per se, the implementation of baseform generation inside decompounder differs from that of the separate baseform plugin: in decompounder it is a heuristic algorithm (Patricia trie?) and in the baseline plugin it is a mere list-based mapping.

Would it be possible to unify the approach to baseform generation? I suggest combining both approached into a single algorithm:

  1. try the mapping-based approach
  2. and if it fails, use heuristics (Patricia trie)

There is a couple of issues that need to addressed in the combined approach. Namely:

  1. the general baseform generator handles any part of speech while decompounder needs to handle nouns only (or mostly nouns, as people may want to decompound adjectives like computergesteuert as well). That said, there could be made available two mappings, one for words coming from decompounder and the other for all other words. The general baseform generator should use both resources, while the decompounder only one.

  2. the general baseform generator is now case-sensitive. The mapping contains entries given in the correct, dictionary, case. However, when a word comes from decompounder its letter case is different. Therefore, the baseform generation inside decompounder should rather be case-insensitive.

Does it make sense?