jprante / elasticsearch-analysis-decompound

Decompounding Plugin for Elasticsearch
GNU General Public License v2.0
87 stars 38 forks source link

Failure to decompound Wandhalter #38

Closed felixbarny closed 8 years ago

felixbarny commented 8 years ago

The term Wandhalterung is split to the tokens wand, alterung instead of wand, halterung. When setting the threshold to 0.63 or higher, the tokens are wandh and alterung. What can I do to fix this?

These are my settings:

index :
    analysis :
        analyzer :
            analyzer_decomp :
                type : custom
                tokenizer : standard
                filter : [lowercase, decomp]
        filter :
            decomp:
                type: decompound
        tokenizer:
            decomp:
                type: standard
                filter:
                  - decomp

I'm using Elasticsearch 2.1.1 and elasticsearch-analysis-decompound 2.1.1.0

felixbarny commented 8 years ago

Interestingly, this problem only appears when the word has the ung suffix. When analyzing Wandhalter everything works as expected.

felixbarny commented 8 years ago

I kind of solved this by applying a stemmer before and after the decomp filter:

index :
    analysis :
        analyzer :
            analyzer_decomp :
                type : custom
                tokenizer : standard
                filter : [lowercase, snow_de, decomp, snow_de]
        filter :
            decomp:
                type: decompound
            snow_de :
                type : snowball
                language : German2
        tokenizer:
            decomp:
                type: standard
                filter:
                  - decomp