Open astariul opened 1 year ago
I get different results from both you and KenLM, but I believe KenLM is making a mistake here.
What I got with KenLM on your corpus:
-0.6726411 went -0.022929981
-0.4034029 You go
What I got with my own Python script:
-0.614294447744974 went -0.022929960646239318
-0.39343950744536116 You go
I believe KenLM is miscalculating the discounts for 1-grams. KenLM has: D1=0.4 D2=1.6 D3+=1.4 I have: D1=0.5555555555555556, D2=1.1666666666666665, D3+=0.7777777777777777
And this is because KenLM miscounts the number of 1-grams with adjusted counts = 1 and 2.
Using the same method as in Issue #427, I find that KenLM prints s.n[1] == 4
and s.n[2] == 3
, i.e. KenLM thinks there are 4 1-grams with adjusted count = 1, and 3 1-grams with adjusted count = 2.
But actually, there are 5 1-grams with adjusted count = 1 (I
, You
, Joe
, Anna
, ski
) -- these all occur after only 1 type of token;
and 2 1-grams with adjusted count = 2 (to
and </s>
) -- these occur after 2 types of tokens.
There are two other causes for the discrepancy:
KenLM does not include <s>
when calculating the vocabulary size, while your program does.
I think KenLM's approach makes more sense -- <s>
is never to be predicted, so we don't need to assign any probability to it.
When calculating the backoff, you're only summing the probability mass discounted from ngrams with adjusted count <= 3:
b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count
While this is a faithful implementation of the second equation in Sec 3.3 of this paper: $\displaystyle b(w1^{n-1}) = \frac{\sum{i=1}^{3} D_n(i) | {x: a(w_1^{n-1}x) = i} |}{\sum_x a(w_1^{n-1} x)}$ (<-- lol GitHub's rendering of the summation symbol) I think the equation in the paper is wrong. The upper limit of the summation should be infinity. Otherwise, the probability mass discounted with adjusted count > 3 will go nowhere, and the total unnormalized probabilities + the backoff will be less than 1. I think KenLM's implementation treats the upper limit as infinity.
I'm trying to make a python script that computes the probabilities and backoffs similarly to
kenLM
.The goal is to reproduce the same outputs, given the same corpus.
However, no matter how much I read the documentation and the paper, I can't get it to work... I would love some external help in order to get it to work and successfully reproduce the same result as
kenLM
.I'm testing on a toy corpus. Here is the content of
test.txt
:I can train a LM using
kenLM
with the following command :lmplz --text test.txt --arpa test.lm -o 2
Now in the
test.lm
file, I can access the probabilities and backoffs computed, for each 1-gram and 2-grams.Here is my python script to compute the probabilities and backoffs :
I followed the formulas from this paper.
But after running this script, I get different probabilities and backoffs. For example, for the 1-gram
went
:kenLM
givesp=-0.6726411
andbackoff=-0.033240937
p=-0.6292122373715772
andbackoff=-0.022929960646239318
For the 2-grams
You go
:kenLM
givesp=-0.4305645
p=-0.3960932540172504
What is the reason for such discrepancies ?