rfdiaz / language-detection

Automatically exported from code.google.com/p/language-detection
0 stars 0 forks source link

Short sentences for Polish are not detected properly #32

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
TEXT="Mam to kino 2,5 roku.Nic dodać.Jest po prostu super."

getProbabilities() for the above text results in:
[hr : 0.9999948455378745]

I am aware of the "short sentence issue" 
http://code.google.com/p/language-detection/issues/detail?id=12&q=short

But for me this is a bug. Why? Becase it is over 50 characters long and 
contains some pretty obvious language features. Take for instance letter "ć", 
it exists only in Polish(pl) and Croatian(hr). If we take a look at frequencies 
in profiles we see that:
profiles/pl - "ć":60605
profiles/hr - "ć":16773

I could understand that getProbabilities returns "hr" but why there is no 
Polish at all???!! Is there any way to teach language detector on my own?

Original issue reported on code.google.com by mic...@senti1.com on 16 Jan 2012 at 8:12

GoogleCodeExporter commented 9 years ago
Polish's profile has more total frequency than Croatian, so its ratio is less.
And your example has many noise (mam, to, roku.Nic, super), so it might mistake.
But I know this is excuse :D

One way is to use suitable profiles to your data.

I created 17 language profiles with twitter corpus and committed here.
http://code.google.com/p/language-detection/source/browse/#git%2Fprofiles.sm
(I will bundle them into language-detection later or sooner)
I tried language detection for your example with these profiles, then it 
outputted the correct result 'pl'.

Original comment by nakatani.shuyo on 17 Jan 2012 at 7:47

GoogleCodeExporter commented 9 years ago
I mistook in the previous comment.
The mentioned profiles do not include Croatian, so it cannot output 'hr'. Sorry.

If you want to detect both languages more precisely, you may generate profiles 
with some appropriate corpus.
langdetect has a tool to generate profiles from Wikipedia database or arbitrary 
text. See here.

http://code.google.com/p/language-detection/wiki/Tools

Original comment by nakatani.shuyo on 17 Jan 2012 at 8:22

GoogleCodeExporter commented 9 years ago
Thanks for quick reply :)

I'll giva a try to twitter profiles.

Original comment by mic...@senti1.com on 17 Jan 2012 at 9:24

GoogleCodeExporter commented 9 years ago
Right now I need language detection to exclude non-polish texts. So lack of 
Croatian is not a problem in twitter corpus :)

Returning to my example. Why getProbabilities() didn't return polish at all? If 
it has bigger total frequency shouldn't it return i.e. [hr: 99% ,pl: 40%]? Is 
there any minimum probability thereshold?

Original comment by mic...@senti1.com on 17 Jan 2012 at 9:31

GoogleCodeExporter commented 9 years ago
If you have only one profile (polish) you can get ~100% prob for most texts, so 
more profiles == better results (look at implementation of updateLangProb). 
Language-detection library is not that good for short textes to exclude or 
recognize correct language. For better twitter detection you should do more 
than detect lang of one tweet. Language detect is perfect for longer texts.
Maybe different language profiles (not based on Wikipedia) could give better 
results.

Original comment by markowsk...@gmail.com on 18 Jan 2012 at 12:48

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
> 4
getProbabilities() returns 'probabilities', so their total nearly equals to 
100%.

Original comment by nakatani.shuyo on 25 Jan 2012 at 3:07

GoogleCodeExporter commented 9 years ago
>5

Thanks, 
It is effective to use multiple tweets as single text for twitter detection.

Original comment by nakatani.shuyo on 25 Jan 2012 at 3:30