Improve language detection using UTF-8 character ranges

jprante / elasticsearch-langdetect

A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector

Apache License 2.0

251 stars 46 forks source link

Improve language detection using UTF-8 character ranges #9

Open gibrown opened 10 years ago

gibrown commented 10 years ago

There are a number of languages that are not currently supported by this plugin but are actually very easy to detect just based on the UTF-8 character ranges that those languages use.

Some examples are given in this gist: https://gist.github.com/gibrown/8652399#file-gistfile1-php-L28

Unfortunately I only have anecdotal data on how well this works at the moment, but its good enough that we run it in production.

jprante commented 10 years ago

Thanks for the hint to use Unicode ranges for language detection. But unfortunately, it is not enough to detect single characters, it could also be they are just in random order and do not form valid words of a language.

gibrown commented 10 years ago

I agree, single characters are not enough, but if X% of the text is from a particular range then this is a pretty reliable way to detect languages that the plugin doesn't currently catch.