ottokart / punctuator2

A bidirectional recurrent neural network model with attention mechanism for restoring missing punctuation in unsegmented text
http://bark.phon.ioc.ee/punctuator
MIT License
657 stars 195 forks source link

nan values while training #46

Closed anavc94 closed 5 years ago

anavc94 commented 5 years ago

Hello again,

I've a question related to the training process. I have been able to train the network successfully until NAN values of PPL appear. In my case, they start to appear in epoch 47, and after that epoch every value of PPL and validation PPL is NAN.

Gonna paste some lines of the output while training (I've print some additional values):

`Epoch = 46 Training with learning rate = 0.02 Total neg log lik = 1292.63056993 PPL = exp(0.00103047717629) PPL: 1.0010; Speed: 12549.65 sps Total neg log lik = 2790.23252964 PPL = exp(0.00111217814479) PPL: 1.0011; Speed: 15146.28 sps Total neg log lik = 4349.67130947 PPL = exp(0.00115584377909) PPL: 1.0012; Speed: 16240.85 sps Total neg log lik = 6142.70772409 PPL = exp(0.00122423224731) PPL: 1.0012; Speed: 16856.70 sps Total neg log lik = 7894.18972039 PPL = exp(0.00125863994266) PPL: 1.0013; Speed: 17253.00 sps Total neg log lik = 9953.48264909 PPL = exp(0.00132247590469) PPL: 1.0013; Speed: 17521.59 sps Total neg log lik = 12641.4703844 PPL = exp(0.00143967182766) PPL: 1.0014; Speed: 17718.59 sps Total neg log lik = 14958.8259408 PPL = exp(0.00149063555692) PPL: 1.0015; Speed: 17859.47 sps Total number of training labels: 10436608 Net saved Total number of validation labels: 14513408 Validation perplexity is 6.6061

Epoch = 47 Training with learning rate = 0.02 Total neg log lik = 1606.82540393 PPL = exp(0.00128095137431) PPL: 1.0013; Speed: 12486.81 sps Total neg log lik = 3401.88553357 PPL = exp(0.00135598115975) PPL: 1.0014; Speed: 15087.08 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 16219.43 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 16852.34 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 17255.23 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 17535.15 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 17743.79 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 17898.87 sps Total number of training labels: 10436608 Net saved Total number of validation labels: 14513408 Validation perplexity is nan

Epoch = 48 Training with learning rate = 0.02 Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 12501.37 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 15115.35 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 16232.72 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 16846.52 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 17248.92 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 17529.18 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 17734.80 sps Total neg log lik = nan PPL = exp(nan) PPL: nan; Speed: 17893.42 sps Total number of training labels: 10436608 Net saved Total number of validation labels: 14513408 Validation perplexity is nan`

I don't understand what is happening. May I am doing something wrong? How can I fix that?

Thank you again!

anavc94 commented 5 years ago

Another question related to this, in that case what can we consider good values of perplexity? Because I've noticed that my validation ppl doesn't decrease after 2 epoch, being the best value 1.28. However, as you can see in the previous output, training ppl decreases and reaches a value of 1.00. Am I overfitting? My dataset is quite large, about 29M of sentences, and I am using a batch size of 256 as you suggested, so it seems to me that training only 2 epochs is not enough. Any advice?

Thanks again, I really appreciate it

ottokart commented 5 years ago

Perfect perplexity is 1.0, so you are definitely overfitting :) 1.28 on validation set is quite good. Normally the script should stop once the validation PPL stops decreasing. Did you modify the source code (elif best_ppl not in validation_ppl_history[-PATIENCE_EPOCHS:]: part maybe)? My experience with this model has shown that 2-5 epochs is usually enough. 2 epochs means that the model has seen all 29M sentences twice.

anavc94 commented 5 years ago

Hello @ottokart

thanks for the quick response! I appreciate it. I finally get 1.13 of validation perplexity just training two epochs and the results I am getting are quite good.

Answering your question, I've changed a bit the script main.py to have more training epochs and more patience epochs, and I am saving the model each epoch to use the one which gives me better validation perplexity. I prefer selecting the model using that approach (just a matter of taste). The thing is that I thought I'd have need more than two epochs to get the optimal model.

Thanks again and I close this issue! Ana