I have skimmed through the code of decompounder plugin and noticed that in addition to doing decompounding itself, it generates baseform of the last word. While it is good per se, the implementation of baseform generation inside decompounder differs from that of the separate baseform plugin: in decompounder it is a heuristic algorithm (Patricia trie?) and in the baseline plugin it is a mere list-based mapping.
Would it be possible to unify the approach to baseform generation? I suggest combining both approached into a single algorithm:
try the mapping-based approach
and if it fails, use heuristics (Patricia trie)
There is a couple of issues that need to addressed in the combined approach. Namely:
the general baseform generator handles any part of speech while decompounder needs to handle nouns only (or mostly nouns, as people may want to decompound adjectives like computergesteuert as well). That said, there could be made available two mappings, one for words coming from decompounder and the other for all other words. The general baseform generator should use both resources, while the decompounder only one.
the general baseform generator is now case-sensitive. The mapping contains entries given in the correct, dictionary, case. However, when a word comes from decompounder its letter case is different. Therefore, the baseform generation inside decompounder should rather be case-insensitive.
I have skimmed through the code of decompounder plugin and noticed that in addition to doing decompounding itself, it generates baseform of the last word. While it is good per se, the implementation of baseform generation inside decompounder differs from that of the separate baseform plugin: in decompounder it is a heuristic algorithm (Patricia trie?) and in the baseline plugin it is a mere list-based mapping.
Would it be possible to unify the approach to baseform generation? I suggest combining both approached into a single algorithm:
There is a couple of issues that need to addressed in the combined approach. Namely:
the general baseform generator handles any part of speech while decompounder needs to handle nouns only (or mostly nouns, as people may want to decompound adjectives like
computergesteuert
as well). That said, there could be made available two mappings, one for words coming from decompounder and the other for all other words. The general baseform generator should use both resources, while the decompounder only one.the general baseform generator is now case-sensitive. The mapping contains entries given in the correct, dictionary, case. However, when a word comes from decompounder its letter case is different. Therefore, the baseform generation inside decompounder should rather be case-insensitive.
Does it make sense?