Open lschmelzeisen opened 9 years ago
I did some testing with the en0008t
corpus using a vocab file I filtered the testing-samples-5.txt
to a vocab-filtered-testing-file.
Using Cond5
querying it resulted in only 2
zero-probabilities of 48019
total sequences.
Accomying Log File, where I also enabled TRACE
-Logging for the two zero-probabilities.
I believe I understand why certain sequences are still zero, for example, I will try to explain why a member of the Alpha
results in a zero-probability, although all tokens are seen in the corpus.
P(Alpha | a member of the) = α(a member of the Alpha) + γ(...) * P'(Alpha | member of the)
α(a member of the Alpha) = c(a member of the Alpha) / c(a member of the _) = 0/19 = 0
P`(Alpha | member of the) = α'(member of the Alpha) + γ'(...) * P'(Alpha | of the)
α'(member of the Alpha) = N_1+(% member of the Alpha) / N_1+(% member of the %) = 0/28 = 0
P'(Alpha | of the) = α'( of the Alpha) + γ'(...) * P'(Alpha | the)
α'(of the Alpha) = N_1+(% of the Alpha) / N_1+(% of the %) = 0/1184 = 0
P'(Alpha | the) = α'(the Alpha) + γ'(...) * P'(Alpha)
α'(the Alpha) = N_1+(% the Alpha) / N_1+(% the %) = 0/6300 = 0
P'(Alpha) = α'(Alpha)
α'(Alpha) = N_1+(% Alpha) / N_1+(% %) = 0/86086 = 0
_
is the Skip with absolute counting, %
is the Skip with continuation counting. I ommitted discounting in the formulas to make them more to the point, as they do not matter for this problem.
MKN uses Continuation-Counts for the lower order (P'
resp. α'
). However in order to have a continuation count for a given sequence it has to contain a Skip. We defined that we use the Skip in front of the sequence for that purpose. As you can see the numerator for all α'
is always zero because % Alpha
never occurs in training, but Alpha
is still part of vocab, because Alpha
appears at the start of a line in our training file (and only there).
Indeed one can see that this explanation is true, because when testing with MKN using Abs-Lower-Order (Estimator: MKNA
) it results in no zero-probabilities.
First of all I guess you mean Alpha
apears at the start of a line in our training file instead of testing file.
I assumed that these cases existed but I would not assume that they make up 80% also I think I remember that Chen good man talk about Lapalce smoothing the lowest order which would exactly extinguish this problem. As we know +1 smoothing is in general poor (which by they way is no suprise according to the original shannon paper since we build the average with the uniform distribution but as we also know it does the trick of leaving no zeros)
What we could also do in our own interpretation of MKN and to get rid of the +1 smoothing is to to cancle out first words from our vocab. By they way it is interesting to see how <s>
and </s>
tags do actually help here because if we include them to all sentences we would not need the length distribution anymore and we would not have the problem of not having a continuation count for a first word of a sentence.
Yes, I meant at start of training (is fixed above).
I don't believe your 80% number (as I only got 2 zero-probabilities in 48019). I ask you do redo your test, making sure you only use words from training. There is a script FilterForTraining
that can do this for you. If you can confirm your 80% number I ask you to upload your training and testing files here.
I think excluding all tokens from vocab that appeared at line start is a good solution.
Can you explain how <s>
and </s>
tags would obsolete our requirement for length distribution?
yes and no. I have empirical evidence but couldn't find a proof. Everyone I talked to has the same problem. Everyone believes it works noone is completely sure. The idea is to add the two tokens but only test on \sigma* where sigma does not consist of the tokens. We should discuss this f2f in our next meeting
as discussed before: In the GLM case I can understand that we have 0 probabilities (not in 80% of our cases) since we haved fixed positions and it could happen that a word ocuring only once in the corpus just does not ocure at the position of the skipped n-gram that we need and see for testing. But in the MKN case we should not have any 0's since we will backoff to lower order models for all conditional probabilities if the probability is zero so in the worst case for the chain rule of the e.g. 4 gram we would have something like P(w1)P(w2)P(w3)P(w4) if all words w1 ,... , w4 only ocure 1 in training. in that case we would also have some backoff weights. Without having looked at the code specifics I guess our mistake is that we optimized for correct testing (sum of probabilities =1) but not for correct implementation of the algorithm