huy510cnt / language-detection

Automatically exported from code.google.com/p/language-detection
0 stars 0 forks source link

prior map hint #13

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago
So how does one set the priorMap to give the library a hint towards a language. 
I have user input to indicate what the language may be. 

If I set a probability - it always goes in that direction. 

e.g. 

Detector detector = DetectorFactory.create();
            HashMap priorMap = new HashMap();
            priorMap.put("ja", new Double(0.001));
            detector.setPriorMap(priorMap);
            detector.append("This is an english sentence.");

I would expect the language to be detected as "en" instead I get it as 'ja" 
with a 0.99999999998567 probability. It seems to be case for all languages - if 
you seed the priorMap table the library just validates that language as the 
right language. 

Am I not using the interface correctly?

Original issue reported on code.google.com by ed_b...@yahoo.com on 16 Apr 2011 at 6:32

GoogleCodeExporter commented 9 years ago
I intend that setPriorMap not only gives weight to each language but also 
restricts ones.
So, as you mentioned, it requires setting all languages you want to detect.
Is it satified your demand slightly if a default prior map can be retrieved?

Original comment by nakatani.shuyo on 18 Apr 2011 at 3:23

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
> priorMap.put("en", new Double(0.0));

The language whose prior set to 0 has always probability 0.
If you want to place the weight on English, set larger probability.

> priorMap.put("ja", new Double(0.01));
> priorMap.put("en", new Double(0.1));

Though the above prior is not normalized(i.e. its sum is not 1.0), the 
setPriorMap method normalizes the prior automatically.

Original comment by nakatani.shuyo on 19 Apr 2011 at 5:31

GoogleCodeExporter commented 9 years ago
[deleted comment]
GoogleCodeExporter commented 9 years ago
This library cannot detect the language of proper nouns like person names, 
place names and so on. (For example, what language is "iPhone"?)
It is because your example is detected as zh-tw that the library uses only the 
frequency rate of each kanji in this case.

Original comment by nakatani.shuyo on 20 Apr 2011 at 6:43

GoogleCodeExporter commented 9 years ago
with comments being deleted, I think it might be a good idea (when anyone has 
the time) to add a full example to wiki on how to use .setPriorMap method - 
this would have a wider benefit to everyone else. Thanks

Original comment by mawa...@live.com on 20 Apr 2011 at 7:04

GoogleCodeExporter commented 9 years ago
I see.
I'll do it. Thanks.

Original comment by nakatani.shuyo on 21 Apr 2011 at 3:12