jprante / elasticsearch-langdetect

A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector
Apache License 2.0
251 stars 46 forks source link

Search language not found #75

Open Bongsakorn opened 6 years ago

Bongsakorn commented 6 years ago

I try to get lang detect via Rest API that found my expected language. But when I search from mapped index which not found the document.

First mapping index

PUT /language_detection_2
{
   "mappings": {
      "stream": {
         "properties": {
            "text": {
               "type": "text",
               "fields": {
                  "language": {
                     "type": "langdetect",
                     "languages": [
                        "ja",
                        "en",
                        "th",
                        "ko"
                     ],
                     "store": true
                  }
               }
            }
         }
      }
   }
}

then put data

PUT language_detection_2/stream/2
{
 "text": "มีความขี้เกียจระดับ 10 วันจันทร์มันก็จะประมาณนี้แหละ 😅 @ True Tower https:\/\/www.instagram.com\/p\/BbtPqZllnPmtfX2MRmUhYT-"
}

PUT language_detection_2/stream/3
{
 "text": "khaohom01ทุกคนเก่งมากคะ❤#mtutd bambam_boobiiผลเท่าไหร่จ้าน้องข้าวหอม😊 khaohom01@bambam_boobii ชนะ1-0ค่ะ😃😃"
}

PUT language_detection_2/stream/4
{
 "text": "นุ้งหมี วิ่งเร็วอะถ่ายไม่ทัน 5555 #narubadin #nw13 #toyotaleaguecup #brutd #mtutd 11.10.60 @ i-mobile Stadium"
}

Then search

GET language_detection_2/_search
{
  "query": {
    "match": {
      "text.language": "th"
    }
  }
}

Got just 2 documents

"hits": {
    "total": 2,
    "max_score": 0.6931472,
    "hits": [
      {
        "_index": "language_detection_2",
        "_type": "stream",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
          "text": "มีความขี้เกียจระดับ 10 วันจันทร์มันก็จะประมาณนี้แหละ 😅 @ True Tower https://www.instagram.com/p/BbtPqZllnPmtfX2MRmUhYT-"
        }
      },
      {
        "_index": "language_detection_2",
        "_type": "stream",
        "_id": "3",
        "_score": 0.2876821,
        "_source": {
          "text": "khaohom01ทุกคนเก่งมากคะ❤#mtutd bambam_boobiiผลเท่าไหร่จ้าน้องข้าวหอม😊 khaohom01@bambam_boobii ชนะ1-0ค่ะ😃😃"
        }
      }
    ]
  }

Is this bug or I do something wrong? How field text.language store the detected languages? Could I display this field?

mdahamiwal commented 6 years ago

Not sure of the inner implementation but the third document text is being detected as "en" even though the language detection favors "th"

GET _langdetect { "text": "นุ้งหมี วิ่งเร็วอะถ่ายไม่ทัน 5555 #narubadin #nw13 #toyotaleaguecup #brutd #mtutd 11.10.60 @ i-mobile Stadium" }

{
  "languages": [
    {
      "language": "th",
      "probability": 0.4285714155915268
    },
    {
      "language": "ro",
      "probability": 0.428569318044014
    },
    {
      "language": "en",
      "probability": 0.14285807944062645
    }
  ]
}