Closed lexand closed 7 years ago
Hi. My config is
$ curl -XGET http://127.0.0.1:9200 { "name" : "mYc2RK-", "cluster_name" : "elasticsearch", "cluster_uuid" : "WCHUEzGyR8yTPfvtJWTBFQ", "version" : { "number" : "5.3.0", "build_hash" : "3adb13b", "build_date" : "2017-03-23T03:31:50.652Z", "build_snapshot" : false, "lucene_version" : "6.4.1" }, "tagline" : "You Know, for Search" } Lang detect 5.3.0.1
Example 1
GET _langdetect { "text" : "какой-то не очень длинный русский текст"} { "languages": [ { "language": "ru", "probability": 0.999997235732777 } ] }
Example 2
GET _langdetect {"text":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442"} { "languages": [ { "language": "hr", "probability": 0.9999997870025434 } ] }
Both texts are identical, but first sends as is, second is unicode escaped. In first example language was determined correctly.
Escaped unicode strings gets from ES PHP library v 5.1.3 (elasticsearch/elasticsearch/src/Elasticsearch/Serializers/SmartSerializer.php:40)
Another example with escaped unicode string which shows that problem probably is in langdetect plugin. Create new doc with unicode escaped string:
POST test/test { "Title":"\u043a\u0430\u043a\u043e\u0439-\u0442\u043e \u043d\u0435 \u043e\u0447\u0435\u043d\u044c \u0434\u043b\u0438\u043d\u043d\u044b\u0439 \u0440\u0443\u0441\u0441\u043a\u0438\u0439 \u0442\u0435\u043a\u0441\u0442" } { "_index": "test", "_type": "test", "_id": "AVszuHOlZhOlEAMN9jBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } GET test/test/AVszuHOlZhOlEAMN9jBe { "_index": "test", "_type": "test", "_id": "AVszuHOlZhOlEAMN9jBe", "_version": 1, "found": true, "_source": { "Title": "какой-то не очень длинный русский текст" } }
Create new doc with unicode string:
POST test/test {"Title": "какой-то не очень длинный русский текст"} { "_index": "test", "_type": "test", "_id": "AVszuHOlZhOlEAMN9jBe", "_version": 1, "result": "created", "_shards": { "total": 2, "successful": 1, "failed": 0 }, "created": true } GET lookmytrips/Image/AVszvDhMZhOlEAMN9jBh { "_index": "test", "_type": "test", "_id": "AVszvDhMZhOlEAMN9jBh", "_version": 1, "found": true, "_source": { "Title": "какой-то не очень длинный русский текст" } }
Us you can see both texts were stored and displayed in correct way.
There was an error in the REST action. Now, with 5.3.0.2, the body is considered as JSON, and parsed as JSON:
POST /_langdetect { "text": "..." }
thanks a lot
Hi. My config is
Example 1
Example 2
Both texts are identical, but first sends as is, second is unicode escaped. In first example language was determined correctly.
Escaped unicode strings gets from ES PHP library v 5.1.3 (elasticsearch/elasticsearch/src/Elasticsearch/Serializers/SmartSerializer.php:40)
Another example with escaped unicode string which shows that problem probably is in langdetect plugin. Create new doc with unicode escaped string:
Create new doc with unicode string:
Us you can see both texts were stored and displayed in correct way.