jprante / elasticsearch-analysis-decompound

Decompounding Plugin for Elasticsearch
GNU General Public License v2.0
87 stars 38 forks source link

Update DecompoundTokenFilter.java #12

Closed elastic-martin closed 8 years ago

elastic-martin commented 9 years ago

I still had some problems with highlighting. I examined how the elasticsearch compound word token filter determines offsets: Every part gets the offsets of the original compound word (see below). When highlighting a match on a part of a compound word the whole compound word is then highlighted. This is ok for me. If the analysis-decompounder would behave in the same way, this would solve issue #6 as well.

PUT http://localhost:9200/test/ { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "myTokenFilter" ] } }, "filter": { "myTokenFilter": { "type": "dictionary_decompounder", "word_list": [ "foot", "ball" ] } } } } }

GET http://localhost:9200/test/_analyze?analyzer=my_analyzer&text=football :

{

"tokens": [
    {
        "token": "football",
        "start_offset": 0,
        "end_offset": 8,
        "type": "<ALPHANUM>",
        "position": 1
    }
    ,
    {
        "token": "foot",
        "start_offset": 0,
        "end_offset": 8,
        "type": "<ALPHANUM>",
        "position": 1
    }
    ,
    {
        "token": "ball",
        "start_offset": 0,
        "end_offset": 8,
        "type": "<ALPHANUM>",
        "position": 1
    }
]

}

jprante commented 9 years ago

Included with commit 07b3100864a3dd36cdda90b9460d318458120a89