detectBlock and random updateLangProb

malcolmgreaves / language-detection

Automatically exported from code.google.com/p/language-detection . Some after-the-fact modifications to get this working within sbt.

Apache License 2.0

5 stars 5 forks source link

detectBlock and random updateLangProb #10

Open GoogleCodeExporter opened 9 years ago

GoogleCodeExporter commented 9 years ago

In Detector.java detectBlock() function draws random int value from 0 to 
ngrams.size(). This approach could give the same ngram multiple times  for 
updateLangProb function. 
For short text it could give a large deviation of result probability. 

I propose (short and medium text lenght) to draw ngram index, who has not been 
yet selected. Perhaps for short text better way is to do a full review.

Original issue reported on code.google.com by markowsk...@gmail.com on 28 Feb 2011 at 8:24

GoogleCodeExporter commented 9 years ago

I agree. The random int approach in this method can be better allocated. I 
never had the time to test a few proposals I have in mind - it would be great 
to have this modified for consistency in results and accuracy.

Original comment by mawa...@live.com on 1 Mar 2011 at 7:46

GoogleCodeExporter commented 9 years ago

[deleted comment]

GoogleCodeExporter commented 9 years ago

ummm... I conducted initial tests and the results don't look very promising. I 
admit I need to debug it more to understand what the best soultion would be 
(not far from your sample though). It would be good if someone else who can 
spend a few hours looking into this to share their thoughts too. I will try and 
look at it in the next couple of weeks along with short language detection 
(from previous issue)

Original comment by mawa...@live.com on 1 Mar 2011 at 6:34

GoogleCodeExporter commented 9 years ago

This is new sample code (some improvements) of detectBlock minimizing deviation 
of result probability (previous deleted). In some cases it works worse, in some 
better.

I also propose some modification to extractNGrams() to add space at the 
beginning and at the end of text. It could improve in some cases of short text.

Of course this isn't solution for short text detection.
In my opinion without dictionary this problem can not be solved.

Original comment by markowsk...@gmail.com on 1 Mar 2011 at 7:58

Attachments:

sample.java

GoogleCodeExporter commented 9 years ago

Thanks for comments and experiment. I would do the experiment, but already done 
:D
As you say, your proposal is better and not better both, I reckon too.

> I also propose some modification to extractNGrams() to add space at the 
beginning and at the end of text. It could improve in some cases of short text.

I'll consider this proposal.

Original comment by nakatani.shuyo on 2 Mar 2011 at 2:26