English detected as af - Githubissues

GoogleCodeExporter commented 9 years ago

What steps will reproduce the problem?
1.I am passing a text "viking river cruise" to detect the language

What is the expected output? What do you see instead?
Expected is English , but it displays "af"

What version of the product are you using? On what operating system?
latest version

Please provide any additional information below.

Original issue reported on code.google.com by srssreej...@gmail.com on 17 Feb 2011 at 4:26

GoogleCodeExporter commented 9 years ago

Though I am not expert in Afrikaans, I just extracted this text from the  
Afrikaans' Wikipedia "Dit het vir die eerste keer..". I am guessing that 
because you entered a short text, the tool didn't have enough characters to 
make an intelligent judgment. The characters in your text are similar to the 
Afrikaans n-gram profile. Hence, the "af" outcome.

You would have a better result if you top this up with a couple of words (in my 
view 200 characters would give you a better accuracy).

Original comment by mawa...@live.com on 17 Feb 2011 at 9:32

GoogleCodeExporter commented 9 years ago

Thanks for comment and response.
As the second comment says, langdetect is not good at very short text detection.
Could you give a longer text?

Original comment by nakatani.shuyo on 18 Feb 2011 at 3:14

GoogleCodeExporter commented 9 years ago

Thank u for quick response.
Its working well for long text.But i am trying to detect language of tweets.
Issues i faced while using language-detection with tweets.
1) If the tweets contains url links then , even if it is english it is detected 
as some other.

So i removed url links from tweets and tried.Thats why the text become too 
short.
In my sample text "Viking river cruise".i split it up and checked with 
lang-detect.
The results are as follows
viking - af
river - en
cruise - fr

So i assume that lang-dect works well with short text too...but while taking 
prob it may differ

Original comment by srssreej...@gmail.com on 18 Feb 2011 at 5:10

GoogleCodeExporter commented 9 years ago

My thoughts are:
-viking on its own could well be "af"
river "en"
cruise "fr"
However, combining them together might not always give you English. Moreover, 
some other short English, Spanish, (.., etc..) words might give you a different 
language.

-In my view, so far, the tool is not always consistant with short text. So, if 
you're after language IDing tweets, I would combine this tool with looking for 
specific words (maybe stop words) to mark the language as well as script types 
(or even character types).

-By the way, the tool now removes URL and emails. It would not remove "tags" 
though. But again it's meant to deal with pure text.

Original comment by mawa...@live.com on 18 Feb 2011 at 8:06

GoogleCodeExporter commented 9 years ago


Hence langdetect uses spelling features to detect, it can't be sure language of 
too short text.
Perhaps, as the forth comment says, short text detection needs to use word 
features, I think too. However all "viking", "river" and "cruise" are not stop 
words... ;(

My acquaintance who use it for tweets is trying to apply langdetect to 
concatenation of several tweets (multilingual tweets of one user are few). 
Could you try it?

Original comment by nakatani.shuyo on 21 Feb 2011 at 3:14

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

I can make over the situvation by simply adding a dummy english word along with 
viking river cruise.such as viking river cruise pleasant holiday.Then the 
language prob tend to english.But i think this is not the best practice.

Original comment by srssreej...@gmail.com on 21 Feb 2011 at 5:34

huy510cnt / language-detection

English detected as af #8