Open Nelrohd opened 10 years ago
Short text is pretty hard to detect the language of. For instance, Google translate also detects your text as Dutch:
http://translate.google.com/#auto/en/je%20vend%20ma%20chemise%20verte
Generally I've found that anything shorter than 300 bytes (about 60 words) does not seem very reliable. I unfortunately haven't gathered any statistical data to find a good cutoff.
Hi,
Do you intend to support the short-text profiles for this purpose ? (Distributed since 03/03/2014 https://code.google.com/p/language-detection/)
That looks promising. Training data based on twitter corpus.
1.4.0.1 released, with the setting "profile": "/langdetect/short-text/"
Thanks for the quick answer and patch! It would be awesome if the "short-text" profile setting could be reachable from the REST API as well :)
1.4.0.2 just released, it has another REST API command for switching profiles.
Thanks a lot, a quick review already shows good improvements, such as:
curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'je vend ma chemise verte'
{
"profile" : "/langdetect/short-text/",
"languages" : [ {
"language" : "fr",
"probability" : 0.5714283159213042
}, {
"language" : "nl",
"probability" : 0.42857000187571836
} ]
}
Hi,
I have some strange results when I use on french text:
curl -XPOST 'localhost:9200/_langdetect?pretty' -d 'je vend ma chemise verte' { "ok" : true, "languages" : [ { "language" : "nl", "probability" : 0.9999951375010268 } ] }
It's french and I get "nl". Something wrong?