jprante / elasticsearch-plugin-bundle

A bundle of useful Elasticsearch plugins
GNU Affero General Public License v3.0
110 stars 17 forks source link

"langdetect" mapping issue language code not retrievable #24

Closed antonsar closed 7 years ago

antonsar commented 7 years ago

Hello,

I am trying this plugin out to handle document with mixed languages. Unfortunately the type "langdetect" is causing some issue for me.

Here are some info that maybe useful: ES version 5.1.1 This bundle plugin version 5.1.1.0 smart_cn analysis plugin - latest kuromoji analysis plugin - latest

Then I did this (following the example):

curl -XDELETE 'localhost:9200/test' curl -XPUT 'localhost:9200/test' curl -XPOST 'localhost:9200/test/article/_mapping' -d ' { "article" : { "properties" : { "content" : { "type" : "langdetect" } } } } ' curl -XPUT 'localhost:9200/test/article/1' -d ' { "title" : "Some title", "content" : "Oh, say can you see by the dawns early light, What so proudly we hailed at the twilights last gleaming?" } '

Finally I did the search after calling refresh

curl -XPOST 'localhost:9200/test/_search' -d ' { "query" : { "term" : { "content" : "en" } } } ' However the search above returns 0 hit.

I double check the mapping and "content" now showing like this:

curl -XGET "localhost:9200/test/_mappings?pretty"

"content" : { "type" : "langdetect", "analyzer" : "_keyword", "include_in_all" : false }

calling curl -XGET 'localhost:9200/test/_search' shows this

"_source" : { "content" : "Oh, say can you see by the dawns early light, What so proudly we hailed at the twilights last gleaming?" }

Based off the examples and the result I was getting, I don't think this is the intended behavior. How should I retrieve the detected language code ?

Thank You!

antonsar commented 7 years ago

Hi JPrante,

Please let me know if you need further details. I appreciate it if you could take a look at this issue.

Essentially what I did is following the langdetect example from the README and it was not returning the correct result. The only factor that are different in my environment is I have 2 additional plugins (smart_cn, and kuromoji plugins).

Thanks in advance!

jprante commented 7 years ago

For langdetect in 5.1.1.0, you have to explicitly declare all languages you want to be detected, like this

PUT /test
{
   "mappings": {
      "article": {
         "properties": {
            "content": {
               "type": "langdetect",
               "languages" : [ "en", "de", "fr" ]
            }
         }
      }
   }
}

Bundle 5.1.1.0 is a preview release, not official.

ICU and hyphen is working and documented, all other analyzers are not reviewed and not well documented. The docs and examples are out of sync. Bugs and changes are to be expected. They will be fixed and documented in future versions.

antonsar commented 7 years ago

Thank you so much for the clarification!!!

Thanks again