Single Probability > 1 - Githubissues

As it turns our we are sometimes returning single probabilities greater one, which is obviously garbage. I did some calculations by hand and the error does not seem to be from implementation errors but theory ones. I used the formulas from Pickhardt (2014) and they seem to match the ones from Chen and Goodman (1999).

This example is the smallest the tests could produce. It is from simple absolute discounting on the abc corpus.

The counts were are going to need are: $c(a ~ b ~ a ~ b) = 1 \\ c(a ~ b ~ a ~ \_) = 1 \\ N_{1+}(* ~ b ~ a ~ b) = 1 \\ N_{1+}(* ~ b ~ a ~ *) = 1 \\ N_{1+}(a ~ b ~ a ~ *) = 1 \\ N_{1+}(* ~ a ~ b) = 3 \\ N_{1+}(* ~ a ~ *) = 5 \\ N_{1+}(b ~ a ~ *) = 2 \\ N_{1+}(* ~ b) = 1 \\ N_{1+}(* ~ *) = 5 \\ N_{1+}(a ~ *) = 2$

With these we can calculate probabilities of absolute discount with a discount value $D = 0.75$ :

$\hat P(b) = \frac{N_{1+}(* ~ b)}{N_{1+}(* ~ *)} = \frac{1}{5} = 0.2$ $\hat P(b ~ | ~ a) = \frac{N_{1+}(* ~ a ~ b) - D}{N_{1+}(* ~ a ~ *)} + \gamma_{med}(a) \cdot \hat P(b) = \frac{N_{1+}(* ~ a ~ b) - D}{N_{1+}(* ~ a ~ *)} + \frac{D \cdot N_{1+}(a *)}{N_{1+}(* ~ a ~ *)} \cdot \hat P(b) \\ = \frac{3 - 0.75}{5} + \frac{0.75 \cdot 2}{5} \cdot 0.2 = 0.51$ $\hat P(b ~ | ~ b ~ a) = \frac{N_{1+}(* ~ b ~ a ~ b) - D}{N_{1+}(* ~ b~ a ~ *)} + \gamma_{med}(b ~ a) \cdot \hat P(b ~ | ~ a) = \frac{N_{1+}(* ~ b ~ a ~ b) - D}{N_{1+}(* ~ b ~ a ~ *)} + \frac{D \cdot N_{1+}(b ~ a *)}{N_{1+}(* ~ b ~ a ~ *)} \cdot \hat P(b ~ | ~ a) \\ = \frac{1 - 0.75}{1} + \frac{0.75 \cdot 2}{1} \cdot 0.51 = 1.015$ $P(b ~ | ~ a ~ b ~ a) = \frac{c(a ~ b ~ a ~ b) - D}{c(a ~ b~ a ~ \_)} + \gamma_{high}(a ~ b ~ a) \cdot \hat P(b ~ | ~ b ~ a) = \frac{c(a ~ b ~ a ~ b) - D}{c(a ~ b ~ a ~ \_)} + \frac{D \cdot N_{1+}(a ~ b ~ a ~ *)}{c(a ~ b ~ a ~ \_)} \cdot \hat P(b ~ | ~ b ~ a) \\ = \frac{1 - 0.75}{1} + \frac{0.75 \cdot 1}{1} \cdot 1.015 = 1.01125$

I just noticed that sometimes the same formula images are shown multiple times. This is how it is supposed to look.

as discussed before it is strange that in MKN all the lower order probabilities are made off continutation counts. That is a rather strange choice. In particular it is not so clear when discounting continutation counts how this fits into the theory and in particular how to calculate the gammas. From the theory the gammas fix the mistake that was made by reducing the probablity.

In case of absolute discounting on a MLE we have (c(abab)-D)/c(aba) since we discounted our MLE we have to see how often we made a mistake of D/c(aba) this happens exaclty N{1+}(aba*) times so the correction term for the lower order probability will be N{1+}(aba*)D/c(aba_)

Now if we use contcounts in the lower order model like N{1+}(*bab)/N{1+}(ba) and discount this with D e.g. (N{1+}(*bab)-D)/N{1+}(ba) we have to see how often we made the error of D/N{1+}(ba) this will happen in exaclty N{1+}(ba) cases so our correction term would be DN{1+}(ba)/N{1+}(ba) so that we can cancle out N_{1+}(ba) and just getting the discount value as a correction factor. when using this approach in your example I arrive at a probability of: P(b|aba) = 0.8875 also we have made a choice here of switching probability spaces. we might additionally change the number of skips that we have at lower orders to not a single contskip in the beginning but to reflecting the total n-gram length

other than that I am not sure if my fix would work for all orders. Another way of testing this is to not discount lower orders but use a real interploation.

As discussed on the phone right now my "fix" is to radical in a way that cannot have two continuation skips in the numerator. comming from N1+(bab) looking at how many classes we could do a mistake we know it cannot be more than 3 (the skip at the beginning was already there) so we must not say the error was DN_{1+}(ba)/N_{1+}(ba) but rather DN_{1+}(_ba)/N_{1+}(ba) in which case we cannot cancle out terms on the fraction. As mentioned in our phone call and on skype the tests are summing up to one and are green. In this snse I will close the issue.

BUT: in the case of generalized language models it was always unclear how to calculate gammas and what kind of skips to use (countinuation skips or normal skips) I have the feeling that in the generalized case we should in the numerator use normal skipps everywhere but in the last position and in the donominator use continuation skips as before. As discussed orally the missing distinction yieled also a mistake in the current master branch of the gneralized language model toolkit

Just some clarification on the GLM case:

Calculating the probability P(b | a b a) for example require the lower order probability ^P(b | _ b a). For this lower order one has to use the fraction (N_1+(* * b a b) - D) / N_1+(* * b a *). While I have no theoretical argument, empirically fractions like (N_1+(* _ b a b) - D) / N_1+(* _ b a *) do not sum to one.

For this lower order probability one now has to calculate the gamma like: gamma_low(_ a b) = D * N_1+(_ _ b a *) / N_1+(* * b a *).

Should now be implemented now for all estimators.

I'm reopening this issue as it turns out that this fix doesn't apply to "mod-kneser-ney"-esque with three discount values D_1, D_2, D_3+. We have a theoretical fix for this using continuation counts of continuation counts ("double-continuation-counts"), but it has to be explored, defined and implemented.

renepickhardt / generalized-language-modeling-toolkit

Single Probability > 1 #99