jprante / elasticsearch-langdetect

A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector
Apache License 2.0
251 stars 46 forks source link

problem with decoding escaped unicode string #60

Closed lexand closed 7 years ago

lexand commented 7 years ago

Hi. My config is

$ curl -XGET http://127.0.0.1:9200
{
  "name" : "mYc2RK-",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "WCHUEzGyR8yTPfvtJWTBFQ",
  "version" : {
    "number" : "5.3.0",
    "build_hash" : "3adb13b",
    "build_date" : "2017-03-23T03:31:50.652Z",
    "build_snapshot" : false,
    "lucene_version" : "6.4.1"
  },
  "tagline" : "You Know, for Search"
}

Lang detect 5.3.0.1

Example 1

GET _langdetect
{  "text" : "какой-то не очень длинный русский текст"}

{
  "languages": [
    {
      "language": "ru",
      "probability": 0.999997235732777
    }
  ]
}

Example 2

GET _langdetect
{"text":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"}

{
  "languages": [
    {
      "language": "hr",
      "probability": 0.9999997870025434
    }
  ]
}

Both texts are identical, but first sends as is, second is unicode escaped. In first example language was determined correctly.

Escaped unicode strings gets from ES PHP library v 5.1.3 (elasticsearch/elasticsearch/src/Elasticsearch/Serializers/SmartSerializer.php:40)

Another example with escaped unicode string which shows that problem probably is in langdetect plugin. Create new doc with unicode escaped string:

POST test/test
{
  "Title":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"
}

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

GET test/test/AVszuHOlZhOlEAMN9jBe

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "found": true,
  "_source": {
    "Title": "какой-то не очень длинный русский текст"
  }
}

Create new doc with unicode string:

POST test/test
{"Title": "какой-то не очень длинный русский текст"}

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszuHOlZhOlEAMN9jBe",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "created": true
}

GET lookmytrips/Image/AVszvDhMZhOlEAMN9jBh

{
  "_index": "test",
  "_type": "test",
  "_id": "AVszvDhMZhOlEAMN9jBh",
  "_version": 1,
  "found": true,
  "_source": {
    "Title": "какой-то не очень длинный русский текст"
  }
}

Us you can see both texts were stored and displayed in correct way.

jprante commented 7 years ago

There was an error in the REST action. Now, with 5.3.0.2, the body is considered as JSON, and parsed as JSON:

POST /_langdetect
{
  "text": "..."
}
lexand commented 7 years ago

thanks a lot