kpu / kenlm

KenLM: Faster and Smaller Language Model Queries
http://kheafield.com/code/kenlm/
Other
2.46k stars 514 forks source link

Reproduce probabilities and backoffs from python #405

Open astariul opened 1 year ago

astariul commented 1 year ago

I'm trying to make a python script that computes the probabilities and backoffs similarly to kenLM.

The goal is to reproduce the same outputs, given the same corpus.

However, no matter how much I read the documentation and the paper, I can't get it to work... I would love some external help in order to get it to work and successfully reproduce the same result as kenLM.


I'm testing on a toy corpus. Here is the content of test.txt :

I went to
I go to
You went to
You go to
Joe went to
Joe go to
Anna went ski
Anna went ski

I can train a LM using kenLM with the following command : lmplz --text test.txt --arpa test.lm -o 2

Now in the test.lm file, I can access the probabilities and backoffs computed, for each 1-gram and 2-grams.


Here is my python script to compute the probabilities and backoffs :

from collections import Counter
import math

from nltk.util import ngrams as compute_ngrams

def main(order=2):
    # Count N-grams
    c = Counter()
    with open("test.txt", "r") as f:
        for line in f:
            processed_line = line.strip().split(" ")
            processed_line = ["<s>"] + processed_line + ["</s>"]

            for ngram in compute_ngrams(processed_line, order):
                c[ngram] += 1

    # Adjust counts
    last_order = c
    a = Counter(c)
    for _ in range(1, order):
        new_order = Counter()
        for ngram, count in last_order.items():
            if count > 0:
                suffix = ngram[1:]
                new_order[suffix] += 1
        a.update(new_order)
        last_order = new_order

    # Make sure <s> is in there (altough its count is 0)
    a[("<s>",)] = 0

    # Compute smoothing statistics
    t = [Counter() for i in range(order + 1)]
    for ngram, count in a.items():
        if 1 <= count <= 4:
            t[len(ngram)][count] += 1

    # Create discount function
    def discount(n, k):
        if k == 0:
            return 0

        if k >= 3:
            k = 3

        y = t[n][1] / (t[n][1] + 2 * t[n][2])
        return k - (k + 1) * y * t[n][k + 1] / t[n][k]

    # Just checking (https://github.com/kpu/kenlm/blob/master/lm/builder/adjust_counts.cc#L60)
    for n in range(1, order + 1):
        for i in range(0, 5):
            assert 0 <= discount(n, i) <= i, f"{discount(n, i)}"

    # Compute pseudo-probabilities and backoff
    u = {}
    b = {}
    for ngram, count in a.items():
        o = len(ngram)
        prefix = ngram[:-1]
        prefix_count = 0
        pcount = Counter()

        # One pass to compute the denominator + backoff terms
        for ngram2, count2 in a.items():
            # if (len(prefix) == 0 and len(ngram2) == 1) or ngram2[:-1] == prefix:
            if ngram2[:-1] == prefix: # and len(ngram2) == o:
                prefix_count += count2
                pcount[count2] += 1

        u[ngram] = (count - discount(o, count)) / prefix_count
        b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count

    # Add empty backoff for maximum order and for </s>
    for ngram in u:
        if len(ngram) == order:
            b[ngram] = 0
    b[("</s>",)] = 0

    # Add unknown word
    u[("<unk>",)] = 0
    b[("<unk>",)] = 0

    # Interpolate to compute the final probabilities
    p = {}

    # First, unigrams
    vocab = [ngram for ngram in u if len(ngram) == 1]
    vocab_size = len(vocab)
    for ngram in vocab:
        p[ngram] = u[ngram] + b[tuple()] / vocab_size

    # Then, compute ngrams, order by order
    for i in range(1, order):
        o = i + 1

        for ngram in u:
            # Skip ngram of wrong order (either they were already computed, or will be computed later)
            if len(ngram) != o:
                continue

            prefix = ngram[:-1]
            suffix = ngram[1:]

            p[ngram] = u[ngram] + b[prefix] * p[suffix]

    # Display
    print("Ngrams probabilities & backoff (and pseudo probabilities) :\n")
    for ngram in p:
        print(f"{ngram} -> {math.log10(p[ngram])} --- {math.log10(b[ngram]) if b[ngram] > 0 else 0}")

if __name__ == "__main__":
    main()

I followed the formulas from this paper.

But after running this script, I get different probabilities and backoffs. For example, for the 1-gram went :

For the 2-grams You go :


What is the reason for such discrepancies ?

MaigoAkisame commented 1 year ago

I get different results from both you and KenLM, but I believe KenLM is making a mistake here.

What I got with KenLM on your corpus:

-0.6726411      went    -0.022929981
-0.4034029      You go

What I got with my own Python script:

-0.614294447744974      went    -0.022929960646239318
-0.39343950744536116    You go

I believe KenLM is miscalculating the discounts for 1-grams. KenLM has: D1=0.4 D2=1.6 D3+=1.4 I have: D1=0.5555555555555556, D2=1.1666666666666665, D3+=0.7777777777777777

And this is because KenLM miscounts the number of 1-grams with adjusted counts = 1 and 2. Using the same method as in Issue #427, I find that KenLM prints s.n[1] == 4 and s.n[2] == 3, i.e. KenLM thinks there are 4 1-grams with adjusted count = 1, and 3 1-grams with adjusted count = 2. But actually, there are 5 1-grams with adjusted count = 1 (I, You, Joe, Anna, ski) -- these all occur after only 1 type of token; and 2 1-grams with adjusted count = 2 (to and </s>) -- these occur after 2 types of tokens.

MaigoAkisame commented 1 year ago

There are two other causes for the discrepancy:

  1. KenLM does not include <s> when calculating the vocabulary size, while your program does. I think KenLM's approach makes more sense -- <s> is never to be predicted, so we don't need to assign any probability to it.

  2. When calculating the backoff, you're only summing the probability mass discounted from ngrams with adjusted count <= 3:

        b[prefix] = sum(discount(o, i) * pcount[i] for i in range(1, 3 + 1)) / prefix_count

    While this is a faithful implementation of the second equation in Sec 3.3 of this paper: $\displaystyle b(w1^{n-1}) = \frac{\sum{i=1}^{3} D_n(i) | {x: a(w_1^{n-1}x) = i} |}{\sum_x a(w_1^{n-1} x)}$ (<-- lol GitHub's rendering of the summation symbol) I think the equation in the paper is wrong. The upper limit of the summation should be infinity. Otherwise, the probability mass discounted with adjusted count > 3 will go nowhere, and the total unnormalized probabilities + the backoff will be less than 1. I think KenLM's implementation treats the upper limit as infinity.