lvapeab / nmt-keras

Neural Machine Translation with Keras
http://nmt-keras.readthedocs.io
MIT License
532 stars 130 forks source link

Unable to resume training using RELOAD_EPOCH = False #71

Closed qazs closed 6 years ago

qazs commented 6 years ago

Hi, I was trying to resume training on my trained model (epoch 10) and I get the error below, seems like it's looking for an update_weights file. How do I create this update file? I'm using the default config settings when training for the first time.

Error:

OSError: Unable to open file (unable to open file: name = 'trained_models/CnTrans_encn_AttentionRNNEncoderDecoder_src_emb_32_bidir_True_enc_LSTM_32_dec_ConditionalLSTM_32_deepout_linear_trg_emb_32_Adam_0.001//update_10_weights.h5', errno = 2, error message = 'No such file or directory', flags = 0, o_flags = 0)

My resumed training config:

RELOAD = 10 
RELOAD_EPOCH = False 
REBUILD_DATASET = False 

If I set RELOAD_EPOCH = True it would work, but then I've to increment my RELOAD value every time, for this case setting it to a value > 10.

lvapeab commented 6 years ago

Hi, I think there's a confusion. You can save/load your models after a number of epochs or updates.

If you do the first, your saved models will have the suffix _epoch_N. If you do the latter, the suffix update_N. You'll need to load the models according to what you saved

qazs commented 6 years ago

Thanks, but how do you load the model? I'm using this python main.py from the documentation.

lvapeab commented 6 years ago

As you almost did:

RELOAD = 10 
RELOAD_EPOCH = True 
REBUILD_DATASET = False 
qazs commented 6 years ago

I'm using this config:

RELOAD = 10 
RELOAD_EPOCH = True 
REBUILD_DATASET = False 

but didn't get any update_N_weights.h5 file. Isn't it suppose to generate a update_N_weights.h5 file?

lvapeab commented 6 years ago

Can you please show me the result of ls trained_models/CnTrans_encn_AttentionRNNEncoderDecoder_src_emb_32_bidir_True_enc_LSTM_32_dec_ConditionalLSTM_32_deepout_linear_trg_emb_32_Adam_0.001/* ?

qazs commented 6 years ago

Here you go, I ran 1 epoc for testing:

Config:

RELOAD = 0
RELOAD_EPOCH = True/False
REBUILD_DATASET = False/True
config.pkl                 epoch_1_structure_init.json  epoch_1_weights_next.h5
epoch_1.h5                 epoch_1_structure_next.json  tensorboard_logs/
epoch_1_Model_Wrapper.pkl  epoch_1_weights_init.h5
(nmt-keras)

If I set the config like below and run again I'll get the error because of the missing update_N_weights.h5 file.

RELOAD = 1
RELOAD_EPOCH = False
REBUILD_DATASET = False
lvapeab commented 6 years ago

If you want to load the model for EPOCH 1, you should switch the RELOAD_EPOCH option to True

VP007-py commented 5 years ago

@lvapeab for reloading for nth epoch one should set RELOAD = N ?

lvapeab commented 5 years ago

Yes, and RELOAD_EPOCH = True.

VP007-py commented 5 years ago

If one is training from scratch it should be RELOAD=0 and RELOAD_EPOCH=True or False?

lvapeab commented 5 years ago

If RELOAD=0, the RELOAD_EPOCH option doesn't matter.