polyfractal / elasticsearch-inquisitor

Site plugin for Elasticsearch to help understand and debug queries.
Apache License 2.0
701 stars 81 forks source link

Multi field with same name applies wrong analyzer #21

Closed jappievw closed 4 years ago

jappievw commented 11 years ago

When a multi-field property has a field with the same name, inquisitor just proposes to use the field.field to test analyzers. However, ElasticSearch then falls back to the default analyzer.

Please consider this example:

curl -XPOST http://127.0.0.1:9200/test_multifield_same_name -d '{
    "mappings" : {
        "type" : {
            "properties" : {
                "brand" : {
                    "type" : "multi_field",
                    "fields" : {
                        "brand" : { "type" : "string", "analyzer" : "whitespace" },
                        "untouched" : { "type" : "string", "index" : "not_analyzed" }
                    }
                }
            }
        }
    }
}'

Inquisitor suggests to test the analyzer by using the following request: curl -XPOST "http://127.0.0.1:9200/test_multifield_same_name/_analyze?pretty=true&field=brand.brand" -d 'This Is Just A Test With Capitalized Words'. This results in the standard analyzer to be used. The result is:

{
  "tokens" : [ {
    "token" : "just",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "<ALPHANUM>",
    "position" : 3
  }, {
    "token" : "test",
    "start_offset" : 15,
    "end_offset" : 19,
    "type" : "<ALPHANUM>",
    "position" : 5
  }, {
    "token" : "capitalized",
    "start_offset" : 25,
    "end_offset" : 36,
    "type" : "<ALPHANUM>",
    "position" : 7
  }, {
    "token" : "words",
    "start_offset" : 37,
    "end_offset" : 42,
    "type" : "<ALPHANUM>",
    "position" : 8
  } ]
}

In order to test the analyzer for this brand field, Elasticsearch expects to test it like this: curl -XPOST "http://127.0.0.1:9200/test_multifield_same_name/_analyze?pretty=true&field=brand" -d 'This Is Just A Test With Capitalized Words'. This in fact results in the whitespace analyzer to be applied. The output is:

{
  "tokens" : [ {
    "token" : "This",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "Is",
    "start_offset" : 5,
    "end_offset" : 7,
    "type" : "word",
    "position" : 2
  }, {
    "token" : "Just",
    "start_offset" : 8,
    "end_offset" : 12,
    "type" : "word",
    "position" : 3
  }, {
    "token" : "A",
    "start_offset" : 13,
    "end_offset" : 14,
    "type" : "word",
    "position" : 4
  }, {
    "token" : "Test",
    "start_offset" : 15,
    "end_offset" : 19,
    "type" : "word",
    "position" : 5
  }, {
    "token" : "With",
    "start_offset" : 20,
    "end_offset" : 24,
    "type" : "word",
    "position" : 6
  }, {
    "token" : "Capitalized",
    "start_offset" : 25,
    "end_offset" : 36,
    "type" : "word",
    "position" : 7
  }, {
    "token" : "Words",
    "start_offset" : 37,
    "end_offset" : 42,
    "type" : "word",
    "position" : 8
  } ]
}

The question is: when the property and the field name have exactly the same name, can you suggest it with just the property name?

polyfractal commented 11 years ago

Interesting, thanks for the fix. This actually looks like a bug in Elasticsearch itself, I'm going to open an issue over there to see if they can fix the underlying issue. Until then, your patch will work great.

Thanks!

jappievw commented 11 years ago

Yeap, it's inconsistent behaviour. I think it's caused by the fact that the multi_field property type has been added in a later release. A design goal was to make it fully backwards compatible, which is what they achieved with allowing brand instead of requiring brand.brand.