StarPos and EndPos for ngram log probability

GoogleCodeExporter commented 9 years ago

Hi

I am a bit confused on how to find the log probabilities of ngrams. From 
PerplexityTest.java the code looks like below
for (i = 1; i <= sent.length - lm_.getLmOrder(); ++i) {
                    final float score = lm_.getLogProb(sent, i, i + lm_.getLmOrder());
                    sentScore += score;
                }
The thing I am not getting is why is it starting from 1 and why i + 
lm_.getLmOrder() and why sent is only number of words in line + 2. 

Ideally I was expecting sent to be number of words line + 3. So if I have a 
sentence, Hello how are you, sent should be START START Hello how are you STOP. 
So the first trigram should be START START Hello. So if I wanted to find the 
log probability of the first trigram I would use startpos 0 and endpos 2. The 
last trigram will be "are you STOP" , startpos 4, and endpos 6.

Obviously I am making some assumptions here. I tried to dig the code to prove 
myself otherwise but unfortunately could not get much intelligence in this 
context.

I will be grateful for any help on this.

Regards
Deb

Original issue reported on code.google.com by db12...@my.bristol.ac.uk on 20 Mar 2014 at 9:03

GoogleCodeExporter commented 9 years ago

This is be design. We actually generate the first word given START, so the 
first generation is a bigram. This is the way SRILM models things, and 
generally accepted, as far as I know. 

If the model were a 5-gram, then the first word would be generated as a bigram, 
the second as a trigram, and so on.

Original comment by adpa...@gmail.com on 20 Mar 2014 at 6:30

Changed state: WontFix

GoogleCodeExporter commented 9 years ago

Ok - thanks a lot for that info. Few questions on the design

1. Can you let me know why the loop starts from i=1, rather than i=0. 

2. Also why is it i + lm_.getLmOrder() ? Is the endPos non inclusive ? 
E.G if i is 1 and getLmOrder() is 3, startpos and endpos will be 1 and 4. If I 
am looking for tri-grams, I would have thought startpos and endpos should be 1 
and 3.

3. I am loading google books binary in my code and need only trigram log 
probabilities. Going by your explanation of using a bigram for the start of the 
sentence, should I pass 0,1 as the startpos and endpos of the first gram ? And 
then again 0,2 as the startpost and endpos of the next gram? and then follow it 
up by 1,3 ?

WOuld be grateful again for your help on this

Original comment by db12...@my.bristol.ac.uk on 21 Mar 2014 at 5:36

GoogleCodeExporter commented 9 years ago

1. The sentence already has start markers, and we don't generate start.

2. endPos is non-inclusive, you're right.

3. I'm confused why you are writing your own code? AWhat exactly is the score 
you want to compute? Why can't you just use ComputeLogProbabilityOfTextStream?

Original comment by adpa...@gmail.com on 7 Sep 2014 at 7:05

malcolmgreaves / berkeleylm

StarPos and EndPos for ngram log probability #19