optimaize / language-detector

Language Detection Library for Java
Apache License 2.0
568 stars 165 forks source link

Cannot find where you divide ngram count from language file by n_words #75

Closed BlazingJ closed 7 years ago

BlazingJ commented 7 years ago

Hello,

i was reading your sources because i am working on language detection for a small projet i have. I came to your project after the shuyo/language-detection one. I cannot find in your sources where you convert the integer values associed with each ngrams from the languages files by the total number of analyzed words for this size of ngram.

In shuyo/language-detection project this is done in DetectorFactory.java line 135. If you don't do this division all result will be biased toward the languages with most words.

Did i miss something in your sources ?

djelinski commented 7 years ago

The stats are recalculated every time a language profile is built. The division happens here: https://github.com/optimaize/language-detector/blob/0dcc68e4fb4e840490195e3473ce0243f678d656/src/main/java/com/optimaize/langdetect/NgramFrequencyData.java#L84

BlazingJ commented 7 years ago

Thank you for pointing it out for me.