Perplexity In theanoLM vs SRILM

senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano

Apache License 2.0

81 stars 29 forks source link

Perplexity In theanoLM vs SRILM #13

Closed sameerkhurana10 closed 6 years ago

sameerkhurana10 commented 8 years ago

Hi,

SRILM n-gram modelling toolkit, gives me two perplexities on my test set. The formula they use to calculate perplexity is given by:

This gives us the geometric average of 1/probability of each token, i.e., perplexity. The exact expression is: ppl = 10^(-logprob / (words - OOVs + sentences)) This gives us the average perplexity per word excluding the </s> tokens. The exact expression is: ppl1 = 10^(-logprob / (words - OOVs))

I wanted to know, how to compare the perplexity that I get from theanoLM vs what I get from SRILM n-gram language model. One thing I can think of is to divide the ppl from SRILM by 10. What else? Just wanted to compare my LSTM LM built using theanoLM and n-gram model built using SRILM.

Thank you

senarvi commented 8 years ago

First of all, make sure both models have exactly the same vocabulary. Otherwise the perplexities are not comparable. SRILM ignores OOV words when computing the perplexity. You get the same behaviour from TheanoLM, if you use --unk-penalty 0. If you have a high percentage of OOV words, I recommend using that during training also. Otherwise the training may be guided towards a network that just gives a high probability for the OOV token - the perplexity gets low but it won't be a good model.

TheanoLM includes sentence end tokens in the perplexity computation, so you should look at the ppl value from SRILM output.

Perplexity is not dependent on the base. SRILM uses base 10 logprobs, so that's why base 10 exponent is taken in the above formula. TheanoLM uses the natural logarithm, but you only need to care about it if you're comparing logprobs. Then you should use --log-base 10, e.g.

theanolm score model.h5 test.txt --output word-scores --log-base 10

Use --log-base 10 also if you're rescoring an n-best list generated using SRILM!

sameerkhurana10 commented 8 years ago

Thanks for the reply.

I did some calculation as suggested by you.

SRILM: PPL calculation using a 4gram LM trained on 130M tokens. The command used to calculate perplexity.

ngram -debug 2 -order 4 -lm mgb.o4g.kn.gz -vocab /data/sls/qcri-scratch/sameer/language_modelling/theanoLM/data/rnnlm_data/input.vocab -limit-vocab -ppl /data/sls/qcri-scratch/sameer/language_modelling/theanoLM/test.txt

Vocab argument given here, is the vocab that I used to train the LSTM language model.

Perplexity is given by:

5002 sentences, 60169 words, 8754 OOVs 0 zeroprobs, logprob= -156679 ppl= 598.63 ppl1= 1115.17

TheanoLM: The command used to train the LSTM LM on 4M tokens is as given below:

theanolm train model.h5 validation-data.txt --training-set training-data.txt --vocabulary input.vocab --architecture lstm300

Perplexity calculation command:

theanolm score model.h5 test.txt --output word-scores --log-base 10

Perplexity is given by:

Number of sentences: 5002 Number of words: 70173 Number of predicted probabilities: 65171 Cross entropy (base e): 5.615544113890271 Cross entropy (base 10): 2.4387998215468305 Perplexity: 274.6627863816429

The question is: Do you think this is the right way to compare the two models? If it is, I am finding it hard to believe that a 4gram language model trained on 130M tokens is giving higher ppl than a LSTM model trained on just 4M tokens.

senarvi commented 8 years ago

The difference in perplexity is so big that I agree with you, there has to be a mistake somewhere. Also, TheanoLM reports 70173 words and 70173 - 5002 = 65171 predicted probabilities (sentence starts excluded) and SRILM reports 60169 words.

You're using --limit-vocab with SRILM, meaning that is discarded from the language model. You should use --unk-penalty 0 with TheanoLM, so that it too excludes from perplexity computation.

Both tools should compute the probability of the same words, so the number of predicted probabilities should match. To debug what's going on, I would start from a smaller test set (maybe just one sentence that also includes OOV words) and compute word scores with both tools (--output word-scores with TheanoLM and -ppl test.txt -debug 2 with SRILM). Then you can see if the numbers are in the same ballpark.

sameerkhurana10 commented 8 years ago

Thank you for the reply and sorry to bother you so much with this.

So, I calculated the score again with --unk-penalty 0 and now the ppl with LSTM is:

Number of sentences: 5002 Number of words: 70173 Number of predicted probabilities: 56304 Cross entropy (base e): 6.153225547136662 Cross entropy (base 10): 2.6723119010275695 Perplexity: 470.23169773500837 (up from 270)

Looking at one sentence with an oov word.

SRILM outputs:

msAr Al<SlAH AlsyAsy fy Almgrb ElY mHk AlAntxAbAt AlmHlyp
        p( msAr | <s> )         = [2gram] 8.29459e-05 [ -4.0812 ]
        p( Al<SlAH | msAr ...)  = [2gram] 0.00205024 [ -2.6882 ]
        p( AlsyAsy | Al<SlAH ...)       = [3gram] 0.166815 [ -0.777764 ]
        p( fy | AlsyAsy ...)    = [3gram] 0.0872803 [ -1.05908 ]
        p( Almgrb | fy ...)     = [3gram] 0.00236401 [ -2.62635 ]
        p( ElY | Almgrb ...)    = [3gram] 0.0145742 [ -1.83641 ]
        p( <unk> | ElY ...)     = [OOV] 0 [ -inf ]
        p( AlAntxAbAt | <unk> ...)      = [1gram] 0.000176341 [ -3.75365 ]
        p( AlmHlyp | AlAntxAbAt ...)    = [2gram] 0.00657387 [ -2.18218 ]
        p( </s> | AlmHlyp ...)  = [3gram] 0.0397053 [ -1.40115 ]
1 sentences, 9 words, 1 OOVs
0 zeroprobs, logprob= -20.406 ppl= 185.068 ppl1= 355.426

TheanoLM outputs:

Sentence 4
log(p(msAr | <s>)) = -4.164788059005853
log(p(Al<SlAH | msAr, <s>)) = -3.755872594620744
log(p(AlsyAsy | Al<SlAH, msAr, <s>)) = -1.000063589824954
log(p(fy | AlsyAsy, Al<SlAH, msAr, ...)) = -1.05446284870925
log(p(Almgrb | fy, AlsyAsy, Al<SlAH, ...)) = -2.314540428851164
log(p(ElY | Almgrb, fy, AlsyAsy, ...)) = -1.9833464453695198
p(<unk> | ElY, Almgrb, fy, ...) is not predicted
log(p(AlAntxAbAt | <unk>, ElY, Almgrb, ...)) = -2.7087249134714755
log(p(AlmHlyp | AlAntxAbAt, <unk>, ElY, ...)) = -3.249091188784179
log(p(</s> | AlmHlyp, AlAntxAbAt, <unk>, ...)) = -0.7922125695889735
Sentence perplexity: 10.338763658658356

I must also say that I have been using the 4gram model for lattice rescoring which is trained on the full text that I have and I have been getting very good results with WER reduction, so I have little reason to doubt that something went wrong when I was building 4gram language model. So, my guess is something went wrong while I was building the LSTM language model.

Any suggestions or comments?

senarvi commented 8 years ago

I spotted one bug: There was an unnecessary base conversion, when computing the sentence perplexity. I fixed it already, but I didn't have time to test it. So in the above "word-scores" output, 10.33876... is incorrect. If you update to the latest version in Git, you should get 216.72... That's a bit worse than 185.068 given by SRILM. But that bug shouldn't affect the overall perplexity computation.

The probabilities that you get from SRILM and TheanoLM are similar, so I'm confident that they're computed correctly. The only suspicious thing I still see is that the tools report different number of words:

TheanoLM: 70173 tokens - 5002 sentence starts - 8867 OOVs = 56304 tokens SRILM: 60169 words, 8754 OOVs

Are you sure you gave both tools the same test data? If you can't find the cause of that mismatch, you can start from one sentence and grow the test data until you find a sentence for which TheanoLM and SRILM report different word or OOV counts.

One more thing: You have a relatively high percentage of OOVs and they have a large effect on the perplexity. I suggest either excluding OOVs during training (--unk-penalty 0) or giving a constant score (e.g. --unk-penalty -5). Otherwise the OOVs may skew the training process.

sameerkhurana10 commented 8 years ago

thanks. Yes, I gave them both the same test data. I will try to find the cause of the mismatch.

I have now started training with ( --unk-penalty 0 )

Hope to see some good results.

On Mon, Jun 6, 2016 at 12:07 AM, Seppo Enarvi notifications@github.com wrote:

I spotted one bug: There was an unnecessary base conversion, when computing the sentence perplexity. I fixed it already, but I didn't have time to test it. So in the above "word-scores" output, 10.33876... is incorrect. If you update to the latest version in Git, you should get 216.72... That's a bit worse than 185.068 given by SRILM. But that bug shouldn't affect the overall perplexity computation.

The probabilities that you get from SRILM and TheanoLM are similar, so I'm confident that they're computed correctly. The only suspicious thing I still see is that the tools report different number of words:

TheanoLM: 70173 tokens - 5002 sentence starts - 8867 OOVs = 56304 tokens SRILM: 60169 words, 8754 OOVs

Are you sure you gave both tools the same test data? If you can't find the cause of that mismatch, you can start from one sentence and grow the test data until you find a sentence for which TheanoLM and SRILM report different word or OOV counts.

One more thing: You have a relatively high percentage of OOVs and they have a large effect on the perplexity. I suggest either excluding OOVs during training (--unk-penalty 0) or giving a constant score (e.g. --unk-penalty -5). Otherwise the OOVs may skew the training process.

— You are receiving this because you authored the thread. Reply to this email directly, view it on GitHub https://github.com/senarvi/theanolm/issues/13#issuecomment-223837710, or mute the thread https://github.com/notifications/unsubscribe/AHV3fZg9YcrAh6WDp60Tkmm1h0TZtpk9ks5qIzqXgaJpZM4ItXmx .

conversation enriches understanding, but solitude is the school of genius.

pranjaldaga commented 8 years ago

Hi @sameerkhurana10 : quick question - you mentioned that the command you used to train on 4M tokens was:

theanolm train model.h5 validation-data.txt --training-set training-data.txt --vocabulary input.vocab --architecture lstm300.

Where did you generate "input.vocab" from?

I am trying to run a simple experiment as follows: theanolm train model.h5 validation-data.txt --training-set training-data.txt but always encountering this exception:

IncompatibleStateError: Vocabulary parameter 'words' is missing from neural network state.

Any thoughts @senarvi ? Thanks!

senarvi commented 8 years ago

@pranjaldaga: It sounds like the file model.h5 exists already, but is invalid. If the file exists, the program tries to continue training from the previous state. Try deleting it first.

sameerkhurana10 commented 6 years ago

back again.

i am still seeing the same behaviour. Here is a preliminary analysis.

trained an n-gram language model using SRILM, command:

ngram-count -text train.dat -order 3 -unk -map-unk "<UNK>" -kndiscount -interpolate -lm o3g.kn.gz

ppl command:

ngram -unk -lm o3g.kn.gz -ppl test-data.txt

ppl : 1063.3 , 00V: 1928

Trained two language models with theanoLM, one with --vocab and the other without.

without --vocab (the vocab is derived automatically by theanolm from the training data (I believe))

score command:

theanolm score model.h5 test-data.txt --output perplexity --exclude-unk

ppl : 2393.5 (believable compared to n-gram). model is trained using hsoftmax

with --vocab option. The model is word-blstm256-softmax. Vocab size is restricted to 40k most frequent words extracted from the training corpus.

score command:

theanolm score model.h5 test-data.txt --output perplexity --exclude-unk

ppl: 201 (hard to believe), oov: 5709

Any comments?

senarvi commented 6 years ago

The vocabulary in your n-gram model is so large (all training words?) that there are only 1928 OOV test words, so the perplexity is higher. In the last case you have 5709 OOV test words, leaving all the low-probability words out of the perplexity computation. It's not hard to believe that perplexity is 201 with a small vocabulary. Do not compare perplexities unless you're using exactly the same vocabulary. Then you should get exactly same OOVs too.

sameerkhurana10 commented 6 years ago

okay, true.