Open gibrown opened 10 years ago
Thanks for the hint to use Unicode ranges for language detection. But unfortunately, it is not enough to detect single characters, it could also be they are just in random order and do not form valid words of a language.
I agree, single characters are not enough, but if X% of the text is from a particular range then this is a pretty reliable way to detect languages that the plugin doesn't currently catch.
There are a number of languages that are not currently supported by this plugin but are actually very easy to detect just based on the UTF-8 character ranges that those languages use.
Some examples are given in this gist: https://gist.github.com/gibrown/8652399#file-gistfile1-php-L28
Unfortunately I only have anecdotal data on how well this works at the moment, but its good enough that we run it in production.