I still had some problems with highlighting.
I examined how the elasticsearch compound word token filter determines offsets: Every part gets the offsets of the original compound word (see below).
When highlighting a match on a part of a compound word the whole compound word is then highlighted. This is ok for me. If the analysis-decompounder would behave in the same way, this would solve issue #6 as well.
I still had some problems with highlighting. I examined how the elasticsearch compound word token filter determines offsets: Every part gets the offsets of the original compound word (see below). When highlighting a match on a part of a compound word the whole compound word is then highlighted. This is ok for me. If the analysis-decompounder would behave in the same way, this would solve issue #6 as well.
PUT http://localhost:9200/test/ { "settings": { "analysis": { "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "standard", "filter": [ "myTokenFilter" ] } }, "filter": { "myTokenFilter": { "type": "dictionary_decompounder", "word_list": [ "foot", "ball" ] } } } } }
GET http://localhost:9200/test/_analyze?analyzer=my_analyzer&text=football :
{
}