Train from logical error

hanskrupakar commented 8 years ago

I am trying to resume training from checkpoint file and even though it says loaded model, the perplexity restarts at weight initialization level and the accuracy of translation when I use evaluate.lua also seems to indicate that the model is simply reinitializing the vectors instead of loading from checkpoint.

Is this an issue with the API? What am I doing wrong?

.......
Epoch: 4, Batch: 11850/11961, Batch size: 16, LR: 0.1000, PPL: 2565.87, |Param|: 5479.77, |GParam|: 44.02, Training: 134/65/69 total/source/target tokens/sec   
Epoch: 4, Batch: 11900/11961, Batch size: 16, LR: 0.1000, PPL: 2573.56, |Param|: 5480.11, |GParam|: 46.07, Training: 134/65/69 total/source/target tokens/sec   
Epoch: 4, Batch: 11950/11961, Batch size: 16, LR: 0.1000, PPL: 2580.50, |Param|: 5480.42, |GParam|: 90.12, Training: 134/65/69 total/source/target tokens/sec   
Train   2582.1220978721 
Valid   2958.3082902242 
saving checkpoint to demo-model_epoch4.00_2958.31.t7    
Script started on Monday 24 October 2016 08:55:52 AM IST
hans@hans-Lenovo-IdeaPad-Y500:~/seq2seq-attn-master$ th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model
using CUDA on GPU 1...
loading data...
done!
Source vocab size: 50004, Target vocab size: 150004
Source max sent len: 50, Target max sent len: 52
Number of additional features on source side: 0
Switching on memory preallocation
loading demo-model_epoch4.00_2958.31.t7...
Number of parameters: 84236504 (active: 84236504)
Epoch: 5, Batch: 50/11961, Batch size: 16, LR: 0.0500, PPL: 375825299.43, |Param|: 5407.84, |GParam|: 503.37, Training: 131/61/69 total/source/target tokens/sec
Epoch: 5, Batch: 100/11961, Batch size: 16, LR: 0.0500, PPL: 145308733.29, |Param|: 5407.19, |GParam|: 130.81, Training: 132/63/69 total/source/target tokens/sec
Epoch: 5, Batch: 150/11961, Batch size: 16, LR: 0.0500, PPL: 85249666.69, |Param|: 5406.86, |GParam|: 1190.36, Training: 133/64/69 total/source/target tokens/sec

guillaumekln commented 8 years ago

I can't reproduce this on the latest revision.

What is the command lines you used to start the training and to resume it?
Did you do any changes to the code?

hanskrupakar commented 8 years ago

I didn't make any changes except specify the epoch to start the loading from. I have attached a log file specifying the train and load from commands, which remain the same except me specifying the load from file.

log.txt

guillaumekln commented 8 years ago

Something is not right. According to your log file, you always run the same command:

th train.lua -data_file data/demo-train.hdf5 -val_data_file data/demo-val.hdf5 -savefile demo-model

Is it the case?

If not, can you share the actual command lines you ran?
If yes, make sure you don't have local modifications in your source code. The logs you are getting do not reflect this command.

hanskrupakar commented 8 years ago

I ran it again from the beginning again after you said it was strange. Attached is the log file for that. Also attached is the train.lua and preprocess.py I used. preprocess.py.docx train.lua.docx error.txt

guillaumekln commented 8 years ago

It seems that AdaGrad does not play nicely with the train_from option at the moment. I would advise you to stick with the default SGD which works well.

Also, please don't set your option within the code. It is error prone and harder for whoever might assist you to know what you are doing.

hanskrupakar commented 8 years ago

Will remember not to inline changes from now. I implemented SGD and the train_from works as expected. Thanks.

harvardnlp / seq2seq-attn

Train from logical error #66