Open GoogleCodeExporter opened 9 years ago
Though I am not expert in Afrikaans, I just extracted this text from the
Afrikaans' Wikipedia "Dit het vir die eerste keer..". I am guessing that
because you entered a short text, the tool didn't have enough characters to
make an intelligent judgment. The characters in your text are similar to the
Afrikaans n-gram profile. Hence, the "af" outcome.
You would have a better result if you top this up with a couple of words (in my
view 200 characters would give you a better accuracy).
Original comment by mawa...@live.com
on 17 Feb 2011 at 9:32
Thanks for comment and response.
As the second comment says, langdetect is not good at very short text detection.
Could you give a longer text?
Original comment by nakatani.shuyo
on 18 Feb 2011 at 3:14
Thank u for quick response.
Its working well for long text.But i am trying to detect language of tweets.
Issues i faced while using language-detection with tweets.
1) If the tweets contains url links then , even if it is english it is detected
as some other.
So i removed url links from tweets and tried.Thats why the text become too
short.
In my sample text "Viking river cruise".i split it up and checked with
lang-detect.
The results are as follows
viking - af
river - en
cruise - fr
So i assume that lang-dect works well with short text too...but while taking
prob it may differ
Original comment by srssreej...@gmail.com
on 18 Feb 2011 at 5:10
My thoughts are:
-viking on its own could well be "af"
river "en"
cruise "fr"
However, combining them together might not always give you English. Moreover,
some other short English, Spanish, (.., etc..) words might give you a different
language.
-In my view, so far, the tool is not always consistant with short text. So, if
you're after language IDing tweets, I would combine this tool with looking for
specific words (maybe stop words) to mark the language as well as script types
(or even character types).
-By the way, the tool now removes URL and emails. It would not remove "tags"
though. But again it's meant to deal with pure text.
Original comment by mawa...@live.com
on 18 Feb 2011 at 8:06
Hence langdetect uses spelling features to detect, it can't be sure language of
too short text.
Perhaps, as the forth comment says, short text detection needs to use word
features, I think too. However all "viking", "river" and "cruise" are not stop
words... ;(
My acquaintance who use it for tweets is trying to apply langdetect to
concatenation of several tweets (multilingual tweets of one user are few).
Could you try it?
Original comment by nakatani.shuyo
on 21 Feb 2011 at 3:14
[deleted comment]
I can make over the situvation by simply adding a dummy english word along with
viking river cruise.such as viking river cruise pleasant holiday.Then the
language prob tend to english.But i think this is not the best practice.
Original comment by srssreej...@gmail.com
on 21 Feb 2011 at 5:34
Original issue reported on code.google.com by
srssreej...@gmail.com
on 17 Feb 2011 at 4:26