jprante / elasticsearch-analysis-decompound

Decompounding Plugin for Elasticsearch
GNU General Public License v2.0
87 stars 38 forks source link

Highlighting seems to be broken #16

Closed Ragmaanir closed 8 years ago

Ragmaanir commented 8 years ago

Hi,

i tried this on 1.5.2 and 1.7.2. This script should reproduce the error (NOTE: Im using port 9400 locally.):

curl -XDELETE http://localhost:9400/xyz/
curl -XPUT http://localhost:9400/xyz/ -d '
index:
  analysis:
    analyzer:
      search_analyzer:
        type: "custom"
        tokenizer: "standard"
        filter:
          - lowercase
          - x_compound
      index_analyzer:
        type: "custom"
        tokenizer: "standard"
        filter:
          - lowercase
          - x_compound
    filter:
      x_compound:
        type: "decompound"
'

curl -XPUT http://localhost:9400/xyz/_mapping/entries -d '
{
  "properties": {
    "title": {
      "type": "string",
      "search_analyzer": "search_analyzer",
      "analyzer": "index_analyzer"
    }
  }
}'

curl -XPOST http://localhost:9400/xyz/entries -d '
{"title": "dies ist ein test"}
'
curl -XPOST http://localhost:9400/xyz/entries -d '
{"title": "dies ist ein testbeitrag"}
'

curl -XPOST http://localhost:9400/xyz/entries -d '
{"title": "dies ist ein titeltest"}
'

curl -XGET http://localhost:9400/xyz/_search?pretty -d '
{
  "fields": ["title"],
  "query": {
    "multi_match": {
      "fields": ["title"],
      "query": "test",
      "analyzer": "search_analyzer"
    }
  },
  "size": 10,
  "highlight": {
    "number_of_fragments": 1,
    "fields": {
      "title": {"number_of_fragments": 1}
    }
  }
}'

The result returned by the query is:

{
  "took" : 4,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "failed" : 0
  },
  "hits" : {
    "total" : 3,
    "max_score" : 0.7123179,
    "hits" : [ {
      "_index" : "xyz",
      "_type" : "entries",
      "_id" : "AVEBt3eMGZQfGkXb9v9D",
      "_score" : 0.7123179,
      "fields" : {
        "title" : [ "dies ist ein test" ]
      },
      "highlight" : {
        "title" : [ "dies ist ein<em>dies ist ein test</em>" ]
      }
    }, {
      "_index" : "xyz",
      "_type" : "entries",
      "_id" : "AVEBt3ezGZQfGkXb9v9E",
      "_score" : 0.5036848,
      "fields" : {
        "title" : [ "dies ist ein testbeitrag" ]
      },
      "highlight" : {
        "title" : [ "dies ist ein<em>dies</em> testbeitrag" ]
      }
    }, {
      "_index" : "xyz",
      "_type" : "entries",
      "_id" : "AVEBt3fmGZQfGkXb9v9F",
      "_score" : 0.5036848,
      "fields" : {
        "title" : [ "dies ist ein titeltest" ]
      },
      "highlight" : {
        "title" : [ "dies ist ein<em>st e</em> titeltest" ]
      }
    } ]
  }
}

As you can see there are two problems with the highlights:

  1. The matched word is not highlighted: [ "dies ist ein<em>st e</em> titeltest" ]
  2. Parts of the sentence are duplicated: [ "dies ist ein<em>dies ist ein test</em>" ]

Maybe i have to change the mapping to use different word positions?

Thanks

jprante commented 8 years ago

Yes, highlighting is broken. When splitting the words, positions of subwords get wrong in the index.

Patches welcome!

Ragmaanir commented 8 years ago

I think #12 fixes this.

jprante commented 8 years ago

Pushed out 1.7.1.3 with the fix. Thanks for the pointer, I forgot to pull the request.

Ragmaanir commented 8 years ago

Thanks :)