Using short phrases leads to erroneous langauge detection

rfdiaz / language-detection

Automatically exported from code.google.com/p/language-detection

0 stars 0 forks source link

Using short phrases leads to erroneous langauge detection #12

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1. Use the phrase "distribution agreement"

I expect it to return english, but it returns french.

I am using version 02-02-2011, on Windows 7

This seems to happen on certain short phrases, the wrong language is returned. 
Is there anything I can do to fix this?

Original issue reported on code.google.com by greg.geo...@gmail.com on 31 Mar 2011 at 8:50

GoogleCodeExporter commented 9 years ago

There was another thread which covered short text detection. I guess this 
langugae id would be more suitable for text more than 100 or 150 characters. 
Anything less, it might struggle with and mix it with other langauges. In the 
summer (when I have a break from Uni) I will focus on finding a good solution 
for short text (10 to 150 characters). For the time being, how often do you 
expect your text to be as short as two words?

Original comment by mawa...@live.com on 31 Mar 2011 at 9:52

GoogleCodeExporter commented 9 years ago

Well actually, I use this solution as a language detection for queries entered 
in our search engine. I was using Google's language detection API except I 
found out that in the terms of service you cannot use it for an enterprise 
based solution, and your IP is blocked after 1000 uses per day (our search 
engine gets more hits). So actually, I would say the average search phrase is 
four words, maybe 15-25 chars

Original comment by greg.geo...@gmail.com on 1 Apr 2011 at 1:17

GoogleCodeExporter commented 9 years ago

short text language detection may work better by using word features (such as 
stop words, character-set, special words.. and so on). I am planning to work on 
this in the summer during my break. For the time being, did you think of 
combining several queries and then feed them into language id? or do you not 
anticipate this use case? (you would expect several terms to be used in any one 
session of using a search engine)

Original comment by mawa...@live.com on 1 Apr 2011 at 6:56

GoogleCodeExporter commented 9 years ago

Detection for short texts was already argued in Issue 8 ( 
http://code.google.com/p/language-detection/issues/detail?id=8 ).
The current model of langdetect is not good at short text detection... I'll 
announce it in langdetect's wiki.

Original comment by nakatani.shuyo on 4 Apr 2011 at 7:53

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

Is there anything new about short text analysis? I would like to replace my 
very heuristic solution 
(https://github.com/karussell/Jetwick/blob/master/src/main/java/de/jetwick/tw/Tw
eetDetector.java) with your more mature approach. I'm doing a very simple 
analysis based on common noise words of every language (hand collected DE and 
EN + google translated to 5 other languages). One major problem is that the 
tokenization is based on whitespaces ... but I like the simplicity as one can 
add languages very easy ;)

Original comment by tableYou...@gmail.com on 25 Nov 2012 at 12:02