renepickhardt / generalized-language-modeling-toolkit

Generalized Language Modeling toolkit
http://glm.rene-pickhardt.de
52 stars 17 forks source link

Why do we have so many zero probabilities with MKN and GLM for seen counts? #93

Open lschmelzeisen opened 9 years ago

renepickhardt commented 9 years ago

as discussed before: In the GLM case I can understand that we have 0 probabilities (not in 80% of our cases) since we haved fixed positions and it could happen that a word ocuring only once in the corpus just does not ocure at the position of the skipped n-gram that we need and see for testing. But in the MKN case we should not have any 0's since we will backoff to lower order models for all conditional probabilities if the probability is zero so in the worst case for the chain rule of the e.g. 4 gram we would have something like P(w1)P(w2)P(w3)P(w4) if all words w1 ,... , w4 only ocure 1 in training. in that case we would also have some backoff weights. Without having looked at the code specifics I guess our mistake is that we optimized for correct testing (sum of probabilities =1) but not for correct implementation of the algorithm

lschmelzeisen commented 9 years ago

I did some testing with the en0008t corpus using a vocab file I filtered the testing-samples-5.txt to a vocab-filtered-testing-file.

Using Cond5 querying it resulted in only 2 zero-probabilities of 48019 total sequences.

Accomying Log File, where I also enabled TRACE-Logging for the two zero-probabilities.

I believe I understand why certain sequences are still zero, for example, I will try to explain why a member of the Alpha results in a zero-probability, although all tokens are seen in the corpus.

P(Alpha | a member of the) = α(a member of the Alpha) + γ(...) * P'(Alpha | member of the)
α(a member of the Alpha) = c(a member of the Alpha) / c(a member of the _) = 0/19 = 0

P`(Alpha | member of the) = α'(member of the Alpha) + γ'(...) * P'(Alpha | of the)
α'(member of the Alpha) = N_1+(% member of the Alpha) / N_1+(% member of the %) = 0/28 = 0

P'(Alpha | of the) = α'( of the Alpha) + γ'(...) * P'(Alpha | the)
α'(of the Alpha) = N_1+(% of the Alpha) / N_1+(% of the %) = 0/1184 = 0

P'(Alpha | the) = α'(the Alpha) + γ'(...) * P'(Alpha)
α'(the Alpha) = N_1+(% the Alpha) / N_1+(% the %) = 0/6300 = 0

P'(Alpha) = α'(Alpha)
α'(Alpha) = N_1+(% Alpha) / N_1+(% %) = 0/86086 = 0

_ is the Skip with absolute counting, % is the Skip with continuation counting. I ommitted discounting in the formulas to make them more to the point, as they do not matter for this problem.

MKN uses Continuation-Counts for the lower order (P' resp. α'). However in order to have a continuation count for a given sequence it has to contain a Skip. We defined that we use the Skip in front of the sequence for that purpose. As you can see the numerator for all α' is always zero because % Alpha never occurs in training, but Alpha is still part of vocab, because Alpha appears at the start of a line in our training file (and only there).

Indeed one can see that this explanation is true, because when testing with MKN using Abs-Lower-Order (Estimator: MKNA) it results in no zero-probabilities.

renepickhardt commented 9 years ago

First of all I guess you mean Alpha apears at the start of a line in our training file instead of testing file. I assumed that these cases existed but I would not assume that they make up 80% also I think I remember that Chen good man talk about Lapalce smoothing the lowest order which would exactly extinguish this problem. As we know +1 smoothing is in general poor (which by they way is no suprise according to the original shannon paper since we build the average with the uniform distribution but as we also know it does the trick of leaving no zeros) What we could also do in our own interpretation of MKN and to get rid of the +1 smoothing is to to cancle out first words from our vocab. By they way it is interesting to see how <s> and </s> tags do actually help here because if we include them to all sentences we would not need the length distribution anymore and we would not have the problem of not having a continuation count for a first word of a sentence.

lschmelzeisen commented 9 years ago

Yes, I meant at start of training (is fixed above).

I don't believe your 80% number (as I only got 2 zero-probabilities in 48019). I ask you do redo your test, making sure you only use words from training. There is a script FilterForTraining that can do this for you. If you can confirm your 80% number I ask you to upload your training and testing files here.

I think excluding all tokens from vocab that appeared at line start is a good solution.

lschmelzeisen commented 9 years ago

Can you explain how <s> and </s> tags would obsolete our requirement for length distribution?

renepickhardt commented 9 years ago

yes and no. I have empirical evidence but couldn't find a proof. Everyone I talked to has the same problem. Everyone believes it works noone is completely sure. The idea is to add the two tokens but only test on \sigma* where sigma does not consist of the tokens. We should discuss this f2f in our next meeting