nan perplexity during training process

frankxu2004 commented 6 years ago

Following the instructions in README, I started training the model with given command. However, for now, it is producing perplexity with a NaN value. Is it normal?

Epoch 10 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 2135 ; PPL nan ; Epoch 10 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 2122 ; PPL nan ; Epoch 10 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 2127 ; PPL nan ; Epoch 10 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 2130 ; PPL nan ; Validation perplexity: nan

Epoch 11 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1316 ; PPL nan ; Epoch 11 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1189 ; PPL nan ; Epoch 11 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1148 ; PPL nan ; Epoch 11 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1132 ; PPL nan ; Validation perplexity: nan

Epoch 12 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1088 ; PPL nan ; Epoch 12 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1083 ; PPL nan ; Epoch 12 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1090 ; PPL nan ; Epoch 12 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1089 ; PPL nan ; Validation perplexity: nan

Epoch 13 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1089 ; PPL nan ; Epoch 13 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1086 ; PPL nan ; Epoch 13 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1086 ; PPL nan ; Epoch 13 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1086 ; PPL nan ; Validation perplexity: nan

Epoch 14 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1070 ; PPL nan ; Epoch 14 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1084 ; PPL nan ; Epoch 14 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1087 ; PPL nan ; Epoch 14 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1089 ; PPL nan ; Validation perplexity: nan

Epoch 15 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1087 ; PPL nan ; Epoch 15 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1076 ; PPL nan ; Epoch 15 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1078 ; PPL nan ; Epoch 15 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1080 ; PPL nan ; Validation perplexity: nan

Epoch 16 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1082 ; PPL nan ; Epoch 16 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1083 ; PPL nan ; Epoch 16 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1077 ; PPL nan ; Epoch 16 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1078 ; PPL nan ; Validation perplexity: nan

Epoch 17 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1086 ; PPL nan ; Epoch 17 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1083 ; PPL nan ; Epoch 17 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1078 ; PPL nan ;

swiseman commented 6 years ago

Hmm, this can happen if the learning rate is too high, but is obviously undesirable. I assume before Epoch 10 you were getting reasonable PPLs? Also, can you tell me what sort of GPU you're training on? Practically, I'd recommend decreasing the learning rate or perhaps changing the seed.

frankxu2004 commented 6 years ago

The GPU is nVidia Titan X And the training parameters are like this: CUDA_VISIBLE_DEVICES=1 th box_train.lua -data roto-train.t7 -save_model roto_jc_rec_tvd -rnn_size 600 -word_vec_size 600 -enc_emb_size 600 -ma$ _batch_size 16 -dropout 0.5 -feat_merge concat -pool mean -enc_layers 1 -enc_relu -report_every 50 -gpuid 1 -epochs 50 -learning_rate 1 -enc_dropout 0 -decay_update2 -layers 2 -copy_generate -t$ nh_query -max_bptt 100 -discrec -rho 1 -partition_feats -recembsize 600 -discdist 1 -seed 0 { input_feed : 1 max_bptt : 100 nrecpreds : 3 switch : false pool : "mean" data : "roto-train.t7" pre_word_vecs_dec : "" just_lm : false map : false report_every : 50 recembsize : 600 word_vec_size : 600 param_init : 0.1 decay_update2 : true curriculum : 0 save_model : "roto_jc_rec_tvd" enc_layers : 1 just_eval : false copy_generate : true enc_emb_size : 600 rnn_size : 600 gen_file : "preds.txt" just_gen : false test : false beam_size : 5 dropout : 0.5 layers : 2 max_batch_size : 16 start_epoch : 1 discdist : 1 feat_merge : "concat" seed : 0 optim : "sgd" train_from : "" gpuid : 1 learning_rate : 1 rho : 1 start_iteration : 1 learning_rate_decay : 0.5 config : "" residual : false enc_relu : true mom : 0.9 discrec : true epochs : 50 max_grad_norm : 5 fix_word_vecs_enc : false tanh_query : true continue : false nfilters : 200 multilabel : false start_decay_at : 10000 enc_dropout : 0 json_log : false nparallel : 1 save_every : 0 pre_word_vecs_enc : "" disable_mem_optimization : false fix_word_vecs_dec : false partition_feats : true recdist : 0 } Loading data from 'roto-train.t7'... USING HACKY GLOBALS!!! regRows: 13; specPadding: 22; nCols: 22; nFeats: 4

tripV: { 1 : 704 2 : 40 3 : 1836 }

vocabulary size: source = 10864; target = 10864
additional features: source = 0; target = 0
maximum sequence length: source = 23; target = 763 nSourceRows 28
number of training instances: 3398
maximum batch size: 16 Building model...
using input feeding Initializing parameters... setting forget gate bias to 2 setting forget gate bias to 2
number of parameters: 45495644

Actually, it starts out being nan even in the first epoch:

Epoch 1 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1802 ; PPL 9345.68 ; Epoch 1 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1869 ; PPL nan ; Epoch 1 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1896 ; PPL nan ; Epoch 1 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1914 ; PPL nan ; Validation perplexity: nan

Epoch 2 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1949 ; PPL nan ; Epoch 2 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1949 ; PPL nan ; Epoch 2 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1949 ; PPL nan ; Epoch 2 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1952 ; PPL nan ; Validation perplexity: nan

Hope these could help.

frankxu2004 commented 6 years ago

Well, after a rerun, this issue seems to be gone... Not sure exactly what happened. Epoch 1 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1265 ; PPL 8226.52 ; Epoch 1 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1296 ; PPL 3359.02 ; Epoch 1 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1307 ; PPL 2249.66 ; Epoch 1 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1313 ; PPL 1623.21 ; Validation perplexity: 493.00677811967 Saving checkpoint to 'roto_jc_rec_tvd_epoch1_493.01.t7'...

Epoch 2 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1326 ; PPL 311.73 ; Epoch 2 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1322 ; PPL 197.73 ; Epoch 2 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1320 ; PPL 149.43 ; Epoch 2 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1322 ; PPL 118.44 ; Validation perplexity: 39.002056544585 Saving checkpoint to 'roto_jc_rec_tvd_epoch2_39.00.t7'...

Epoch 3 ; Iter 50/213 ; LR 1.0000 ; Target tokens/s 1325 ; PPL 37.82 ; Epoch 3 ; Iter 100/213 ; LR 1.0000 ; Target tokens/s 1325 ; PPL 35.52 ; Epoch 3 ; Iter 150/213 ; LR 1.0000 ; Target tokens/s 1320 ; PPL 33.54 ; Epoch 3 ; Iter 200/213 ; LR 1.0000 ; Target tokens/s 1320 ; PPL 31.60 ;

harvardnlp / data2text

nan perplexity during training process #9