nuance1979 / srilm-python

Python binding for SRI Language Modeling Toolkit implemented in Cython
MIT License
29 stars 7 forks source link

Fix bug with prob_ngram() #3

Closed menphix-watanabe closed 2 years ago

menphix-watanabe commented 7 years ago

In sirlm/base.pyx:prob_ngram(): context = ngram[:-1].reverse() This line is assigning None value to context, because .reverse() doesn't return anything. So prob_ngram() is always using the empty context.

Verified by:

#!/usr/bin/env python
import srilm
from srilm import ngram
vocab = srilm.vocab.Vocab()
lm = srilm.ngram.Lm(vocab, 4)
lm.read('tests/lm.txt')
lm.debug_level = 5
prob_ngram_result = lm.prob_ngram(vocab.index(['it', 'was', 'the']))

the_idx = vocab.index(['the'])
full_phrase = vocab.index(['it', 'was', 'the'])
context = full_phrase[:-1]
context.reverse()
prob_result = lm.prob(the_idx[0], context)
print("prob: {0}".format(prob_result))
print("prob_ngram: {0}".format(prob_ngram_result))

prob: -0.07153774797916412 prob_ngram: -0.07153774797916412

Also by verified by srilm ngram binary:

$ echo "it was the" > ./input.txt
$ ngram -lm ./tests/lm.txt -ppl ./input.txt -no-sos -no-eos -debug 2
reading 109 1-grams
./tests/lm.txt: line 10: warning: non-zero probability for <unk> in closed-vocabulary LM
reading 372 2-grams
reading 616 3-grams
it was the
        p( it |  )      = [1gram] 0.005376345 [ -2.269513 ]
        p( was | it ...)        = [2gram] 0.6549599 [ -0.1837853 ]
        p( the | was ...)       = [3gram] 0.8481297 [ -0.07153775 ]
0 sentences, 3 words, 0 OOVs
0 zeroprobs, logprob= -2.524836 ppl= 6.944036 ppl1= 6.944036

file ./input.txt: 0 sentences, 3 words, 0 OOVs
0 zeroprobs, logprob= -2.524836 ppl= 6.944036 ppl1= 6.944036

p( the | was ...) is -0.07153775, which ~equals to -0.07153774797916412

@nuance1979 Please review. Thanks! :-)

menphix-watanabe commented 2 years ago

@nuance1979 Could you please take a look again? Thanks!