Closed GoogleCodeExporter closed 9 years ago
This is be design. We actually generate the first word given START, so the
first generation is a bigram. This is the way SRILM models things, and
generally accepted, as far as I know.
If the model were a 5-gram, then the first word would be generated as a bigram,
the second as a trigram, and so on.
Original comment by adpa...@gmail.com
on 20 Mar 2014 at 6:30
Ok - thanks a lot for that info. Few questions on the design
1. Can you let me know why the loop starts from i=1, rather than i=0.
2. Also why is it i + lm_.getLmOrder() ? Is the endPos non inclusive ?
E.G if i is 1 and getLmOrder() is 3, startpos and endpos will be 1 and 4. If I
am looking for tri-grams, I would have thought startpos and endpos should be 1
and 3.
3. I am loading google books binary in my code and need only trigram log
probabilities. Going by your explanation of using a bigram for the start of the
sentence, should I pass 0,1 as the startpos and endpos of the first gram ? And
then again 0,2 as the startpost and endpos of the next gram? and then follow it
up by 1,3 ?
WOuld be grateful again for your help on this
Original comment by db12...@my.bristol.ac.uk
on 21 Mar 2014 at 5:36
1. The sentence already has start markers, and we don't generate start.
2. endPos is non-inclusive, you're right.
3. I'm confused why you are writing your own code? AWhat exactly is the score
you want to compute? Why can't you just use ComputeLogProbabilityOfTextStream?
Original comment by adpa...@gmail.com
on 7 Sep 2014 at 7:05
Original issue reported on code.google.com by
db12...@my.bristol.ac.uk
on 20 Mar 2014 at 9:03