jprante / elasticsearch-langdetect

A plugin for language detection in Elasticsearch using Nakatani Shuyo's language detector
Apache License 2.0
251 stars 46 forks source link

Accuracy problem #10

Open Nelrohd opened 10 years ago

Nelrohd commented 10 years ago

Hi,

I have some strange results when I use on french text:

curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'je vend ma chemise verte' { "ok" : true, "languages" : [ { "language" : "nl", "probability" : 0.9999951375010268 } ] }

It's french and I get "nl". Something wrong?

gibrown commented 10 years ago

Short text is pretty hard to detect the language of. For instance, Google translate also detects your text as Dutch:

http://translate.google.com/#auto/en/je%20vend%20ma%20chemise%20verte

Generally I've found that anything shorter than 300 bytes (about 60 words) does not seem very reliable. I unfortunately haven't gathered any statistical data to find a good cutoff.

adrienschuler-zz commented 9 years ago

Hi,

Do you intend to support the short-text profiles for this purpose ? (Distributed since 03/03/2014 https://code.google.com/p/language-detection/)

gibrown commented 9 years ago

That looks promising. Training data based on twitter corpus.

jprante commented 9 years ago

1.4.0.1 released, with the setting "profile": "/langdetect/short-text/"

adrienschuler-zz commented 9 years ago

Thanks for the quick answer and patch! It would be awesome if the "short-text" profile setting could be reachable from the REST API as well :)

jprante commented 9 years ago

1.4.0.2 just released, it has another REST API command for switching profiles.

adrienschuler-zz commented 9 years ago

Thanks a lot, a quick review already shows good improvements, such as:

curl -XPOST 'localhost:9200/_langdetect?pretty' -d  'je vend ma chemise verte'
{
    "profile" : "/langdetect/short-text/",
    "languages" : [ {
        "language" : "fr",
        "probability" : 0.5714283159213042
    }, {
        "language" : "nl",
        "probability" : 0.42857000187571836
    } ]
}