senarvi / theanolm

TheanoLM is a recurrent neural network language modeling tool implemented using Theano
Apache License 2.0
81 stars 29 forks source link

Confusing in result of perplexity score #24

Closed kharazi closed 7 years ago

kharazi commented 7 years ago

I'm trying to train a language model using Tiny-Shakespeare dataset. It seems that language model is correctly trained and I got the following output for this file:

 I am a glad.
 I am a glads.
 I am glad to meet you.
 I are is.

As you can see, there is word glad in the dataset, but while scoring the given text the text scorer labels it as <unk> (I tested this for different words.). Why it happened?

# Sentence 1
log(p(I | <s>)) = -3.4169204235076904
log(p(am | <s> I)) = -2.3558216094970703
log(p(a | <s> I am)) = -1.5888748168945312
log(p(<unk> | ... I am a)) = -0.14008308947086334
log(p(</s> | ... am a <unk>)) = -4.105915069580078
Sentence perplexity: 10.191183683663455

# Sentence 2
log(p(I | <s>)) = -3.4169204235076904
log(p(am | <s> I)) = -2.3558216094970703
log(p(a | <s> I am)) = -1.5888748168945312
log(p(<unk> | ... I am a)) = -0.14008308947086334
log(p(</s> | ... am a <unk>)) = -4.105915069580078
Sentence perplexity: 10.191183683663455

# Sentence 3
log(p(I | <s>)) = -3.4169204235076904
log(p(am | <s> I)) = -2.3558216094970703
log(p(<unk> | <s> I am)) = -0.883741021156311
log(p(to | ... I am <unk>)) = -1.9496990442276
log(p(<unk> | ... am <unk> to)) = -0.3969348669052124
log(p(<unk> | ... <unk> to <unk>)) = -1.1432193517684937
log(p(</s> | ... to <unk> <unk>)) = -2.1247739791870117
Sentence perplexity: 5.771983351818038

# Sentence 4
log(p(I | <s>)) = -3.4169204235076904
log(p(are | <s> I)) = -7.457761764526367
log(p(is. | <s> I are)) = -13.031048774719238
log(p(</s> | ... I are is.)) = -0.31377649307250977
Sentence perplexity: 426.18642331347115

Number of sentences: 4
Number of words: 25
Number of out-of-vocabulary words: 5
Number of predicted probabilities: 21
Cross entropy (base e): 2.8431356080940793
Perplexity: 17.169518099543954
senarvi commented 7 years ago

Have you done any preprocessing for the data? Usually the punctuation is deleted or separated from the words. For example, you might want to replace

Go, get you home, you fragments!

with

Go , get you home , you fragments !

In your first example sentence you have the word glad. instead of glad, which does not appear in the training data. You might also want to replace all words with lower case, as you have such a small data set.

In the third sentence you have two words, glad and meet that actually should be in the training data. Have you provided any vocabulary for the train command? What is the command you used to train the model?

kharazi commented 7 years ago

I don't have any pre-processing to this dataset but I think it shouldn't make problem. For example when I test it with a sentence that actually is in the dataset like glad to see(4 occurrences in the dataset) I got this result:

log(p(I | <s>)) = -3.4169204235076904
log(p(am | <s> I)) = -2.3558216094970703
log(p(<unk> | <s> I am)) = -0.883741021156311
log(p(to | ... I am <unk>)) = -1.9496990442276
log(p(see | ... am <unk> to)) = -4.616444110870361
log(p(your | ... <unk> to see)) = -2.7698867321014404
log(p(<unk> | ... to see your)) = -0.06693568825721741
log(p(</s> | ... see your <unk>)) = -1.7577476501464844
Sentence perplexity: 9.273394881980112

My train command is theanolm train model.h5 with no extra options. Is there any way to check vocabulary?

senarvi commented 7 years ago

You should also give a training file, and probably also a validation file (for determining when to stop training). Another option is to train a fixed number of epochs, e.g:

theanolm train model.h5 --training-set input.txt --max-epochs 5

I trained 5 epochs using above command, then wrote "glad to see" into test.txt, and computed word probabilities using

theanolm score model.h5 test.txt --output word-scores

I got:

Using gpu device 0: Quadro K2000 (CNMeM is disabled, cuDNN not available)
Reading vocabulary from network state.
Number of words in vocabulary: 25673
Number of word classes: 25673
Building neural network.
Restoring neural network state.
Building text scorer.
Scoring text.
# Sentence 1
log(p(glad | <s>)) = -11.732762336730957
log(p(to | <s> glad)) = -3.53452730178833
log(p(see | <s> glad to)) = -4.968919277191162
log(p(</s> | ... glad to see)) = -7.588036060333252
Sentence perplexity: 1049.4917139680845

Number of sentences: 1
Number of words: 5
Number of tokens: 5
Number of predicted probabilities: 4
Number of excluded (OOV) words: 0
Cross entropy (base e): 6.956061244010925
Perplexity: 1049.4917139680845

One thing is that you can get some unexpected results if you forget that model.h5 already exists from some previous training, because TheanoLM tries to continue training from that existing model.

There's actually a way to look at the model file using h5dump. You can print the contents using h5dump --contents model.h5 and display the vocabulary using h5dump --dataset=/vocabulary/words model.h5.

kharazi commented 7 years ago

It seems that there was a problem while training. Thanks for your help!