Reproduce probabilities and backoffs from python

astariul commented 1 year ago

I'm trying to make a python script that computes the probabilities and backoffs similarly to kenLM.

The goal is to reproduce the same outputs, given the same corpus.

However, no matter how much I read the documentation and the paper, I can't get it to work... I would love some external help in order to get it to work and successfully reproduce the same result as kenLM.

I'm testing on a toy corpus. Here is the content of test.txt :

I went to
I go to
You went to
You go to
Joe went to
Joe go to
Anna went ski
Anna went ski

I can train a LM using kenLM with the following command : lmplz --text test.txt --arpa test.lm -o 2

Now in the test.lm file, I can access the probabilities and backoffs computed, for each 1-gram and 2-grams.

Here is my python script to compute the probabilities and backoffs :

from collections import Counter
import math

from nltk.util import ngrams as compute_ngrams

def main(order=2):
    # Count N-grams
    c = Counter()
    with open("test.txt", "r") as f:
        for line in f:
            processed_line = line.strip().split(" ")
            processed_line = ["<s>"] + processed_line + ["</s>"]

            for ngram in compute_ngrams(processed_line, order):
                c[ngram] += 1

    # Adjust counts
    last_order = c
    a = Counter(c)
    for _ in range(1, order):
        new_order = Counter()
        for ngram, count in last_order.items():
            if count > 0:
                suffix = ngram[1:]
                new_order[suffix] += 1
        a.update(new_order)
        last_order = new_order

    # Make sure <s> is in there (altough its count is 0)
    a[("<s>",)] = 0

    # Compute smoothing statistics
    t = [Counter() for i in range(order + 1)]
    for ngram, count in a.items():
        if 1 <= count <= 4:
            t[len(ngram)][count] += 1

    # Create discount function
    def discount(n, k):
        if k == 0:
            return 0

        if k >= 3:
            k = 3

        y = t[n][1] / (t[n][1] + 2 * t[n][2])
        return k - (k + 1) * y * t[n][k + 1] / t[n][k]

    # Just checking (https://github.com/kpu/kenlm/blob/master/lm/builder/adjust_counts.cc#L60)
    for n in range(1, order + 1):
        for i in range(0, 5):
            assert 0 <= discount(n, i) <= i, f"{discount(n, i)}"

    # Compute pseudo-probabilities and backoff
    u = {}
    b = {}
    for ngram, count in a.items():
        o = len(ngram)
        prefix = ngram[:-1]
        prefix_count = 0
        pcount = Counter()

        # One pass to compute the denominator + backoff terms
        for ngram2, count2 in a.items():
            # if (len(prefix) == 0 and len(ngram2) == 1) or ngram2[:-1] == prefix:
            if ngram2[:-1] == prefix: # and len(ngram2) == o:
                prefix_count += count2
                pcount[count2] += 1

        u[ngram] = (count - discount(o, count)) / prefix_count
        b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count

    # Add empty backoff for maximum order and for </s>
    for ngram in u:
        if len(ngram) == order:
            b[ngram] = 0
    b[("</s>",)] = 0

    # Add unknown word
    u[("<unk>",)] = 0
    b[("<unk>",)] = 0

    # Interpolate to compute the final probabilities
    p = {}

    # First, unigrams
    vocab = [ngram for ngram in u if len(ngram) == 1]
    vocab_size = len(vocab)
    for ngram in vocab:
        p[ngram] = u[ngram] + b[tuple()] / vocab_size

    # Then, compute ngrams, order by order
    for i in range(1, order):
        o = i + 1

        for ngram in u:
            # Skip ngram of wrong order (either they were already computed, or will be computed later)
            if len(ngram) != o:
                continue

            prefix = ngram[:-1]
            suffix = ngram[1:]

            p[ngram] = u[ngram] + b[prefix] * p[suffix]

    # Display
    print("Ngrams probabilities & backoff (and pseudo probabilities) :\n")
    for ngram in p:
        print(f"{ngram} -> {math.log10(p[ngram])} --- {math.log10(b[ngram]) if b[ngram] > 0 else 0}")

if __name__ == "__main__":
    main()

I followed the formulas from this paper.

But after running this script, I get different probabilities and backoffs. For example, for the 1-gram went :

kenLM gives p=-0.6726411 and backoff=-0.033240937
My script gives p=-0.6292122373715772 and backoff=-0.022929960646239318

For the 2-grams You go :

kenLM gives p=-0.4305645
My script gives p=-0.3960932540172504

What is the reason for such discrepancies ?

MaigoAkisame commented 1 year ago

I get different results from both you and KenLM, but I believe KenLM is making a mistake here.

What I got with KenLM on your corpus:

-0.6726411      went    -0.022929981
-0.4034029      You go

What I got with my own Python script:

-0.614294447744974      went    -0.022929960646239318
-0.39343950744536116    You go

I believe KenLM is miscalculating the discounts for 1-grams. KenLM has: D1=0.4 D2=1.6 D3+=1.4 I have: D1=0.5555555555555556, D2=1.1666666666666665, D3+=0.7777777777777777

And this is because KenLM miscounts the number of 1-grams with adjusted counts = 1 and 2. Using the same method as in Issue #427, I find that KenLM prints s.n[1] == 4 and s.n[2] == 3, i.e. KenLM thinks there are 4 1-grams with adjusted count = 1, and 3 1-grams with adjusted count = 2. But actually, there are 5 1-grams with adjusted count = 1 (I, You, Joe, Anna, ski) -- these all occur after only 1 type of token; and 2 1-grams with adjusted count = 2 (to and </s>) -- these occur after 2 types of tokens.

MaigoAkisame commented 1 year ago

There are two other causes for the discrepancy:

KenLM does not include <s> when calculating the vocabulary size, while your program does. I think KenLM's approach makes more sense -- <s> is never to be predicted, so we don't need to assign any probability to it.
When calculating the backoff, you're only summing the probability mass discounted from ngrams with adjusted count <= 3:
```
    b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count
```
While this is a faithful implementation of the second equation in Sec 3.3 of this paper: $\displaystyle b(w1^{n-1}) = \frac{\sum{i=1}^{3} D_n(i) | {x: a(w_1^{n-1}x) = i} |}{\sum_x a(w_1^{n-1} x)}$ (<-- lol GitHub's rendering of the summation symbol) I think the equation in the paper is wrong. The upper limit of the summation should be infinity. Otherwise, the probability mass discounted with adjusted count > 3 will go nowhere, and the total unnormalized probabilities + the backoff will be less than 1. I think KenLM's implementation treats the upper limit as infinity.

kpu / kenlm

Reproduce probabilities and backoffs from python #405