How to resume training (finetuning) on the checkpoint(saved) model?

fansiawang commented 7 years ago

For example, I have trained a model for 300,000 rounds, and save the model successfully. What if I want to go on training base on the saved model, saying, I want to train 300,000 more rounds. I can not find some documentation. Does anybody know the command details?

ShenYounger commented 7 years ago

yeah。i have the same question with u。@oahziur 。could you help us solve this problem?

oahziur commented 7 years ago

The model will save everything to out_dir. As long as out_dir is not removed, re-run the same training command will continue the training start from the latest saved checkpoint.

fansiawang commented 7 years ago

@oahziur You are absolutely right! I resumed training successfully! Now I have another question. I try to finetune new dataset on the saved checkpoint. Refer to the previous continuing training method, I copy four files(checkpoint, translate.ckpt-11000.data-00000-of-00001, translate.ckpt-11000.index, translate.ckpt-11000.meta) in a new out_dir I just change the train data path and hparams file. I execute the following command:

python -m nmt.nmt \
  --hparams_path=./nmt/standard_hparams/tst2012.json \
  --src=vi --tgt=en \
  --vocab_prefix=./nmt/nmt_data/iwslt15/vocab \
  --train_prefix=./nmt/nmt_data/iwslt15/tst2012 \
  --dev_prefix=./nmt/nmt_data/iwslt15/tst2013 \
  --out_dir=./nmt/nmt_model

But it only tests the dev data, not starts training from the saved checkpoint. The log is as follows:

loaded infer model parameters from ./nmt/nmt_model/translate.ckpt-11000, time 0.12s
# External evaluation, global step 11000
  decoding to output ./nmt/nmt_model/output_dev.

and the learning rate is also a little strange:

saving hparams to ./nmt/nmt_model/hparams
# Final, step 11000 lr 0.000976562 step-time 0.00 wps 0.00K ppl 0.00, dev ppl 8.51, dev bleu 24.7, Fri Aug  4 12:13:45 2017
# Done training!, time 128s, Fri Aug  4 12:13:45 2017.
# Start evaluating saved best models.

The previous hparams file is iwslt15.json, the new hparams file is tst2012.json. The difference between iwslt15.json and tst2012.json is num_train_steps, start_decay_step and decay_steps. But the GNMT doesn't use thetst2012.json, it just tests the dev data. So I'm confusing that which is the correct method to do the finetune. Can I need to modify other files to realize the correct finetune? Could you help me solve this problem? Thank you very much!!!

oahziur commented 7 years ago

@fansiawang Do you have the ./nmt/nmt_model/hparams file before you start the training? You can check if the parameters in ./nmt/nmt_model/hparams matches your tst2012.json

Here is the code how we load hparams in the model.

fansiawang commented 7 years ago

@oahziur I have the ./nmt/nmt_model/hparams file before I start the training, and I try to match the parameters in ./nmt/nmt_model/hparams with the tst2012.json. But the parameters in ./nmt/nmt_model/hparams are different with tst2012.json. The ./nmt/nmt_model/hparams file is copied from previous trained model, which is I want to finetune on it. For example, there is no parameter named best_bleu in tst2012.json, which exists in ./nmt/nmt_model/hparams. The ./nmt/nmt_model/hparams is as follows: {"pass_hidden_state": true, "steps_per_stats": 100, "tgt": "en", "out_dir": "./nmt/nmt_model", "source_reverse": false, "sos": "<s>", "encoder_type": "bi", "best_bleu": 21.98009987821807, "tgt_vocab_size": 17191, "num_layers": 2, "optimizer": "sgd", "init_weight": 0.1, "tgt_vocab_file": "./nmt/nmt_data/iwslt15/vocab.en", "src_max_len_infer": null, "beam_width": 10, "src_vocab_size": 7709, "decay_factor": 0.5, "src_max_len": 50, "vocab_prefix": "./nmt/nmt_data/iwslt15/vocab", "share_vocab": false, "test_prefix": null, "attention_architecture": "standard", "bpe_delimiter": null, "epoch_step": 527, "infer_batch_size": 32, "src_vocab_file": "./nmt/nmt_data/iwslt15/vocab.vi", "colocate_gradients_with_ops": true, "learning_rate": 1.0, "start_decay_step": 1000, "unit_type": "lstm", "num_train_steps": 5000, "time_major": true, "dropout": 0.2, "attention": "scaled_luong", "tgt_max_len": 50, "batch_size": 128, "residual": false, "metrics": ["bleu"], "length_penalty_weight": 0.0, "train_prefix": "./nmt/nmt_data/iwslt15/train", "forget_bias": 1.0, "max_gradient_norm": 5.0, "num_residual_layers": 0, "log_device_placement": false, "random_seed": null, "src": "vi", "num_gpus": 1, "dev_prefix": "./nmt/nmt_data/iwslt15/tst2012", "max_train": 0, "steps_per_external_eval": null, "eos": "</s>", "decay_steps": 1000, "tgt_max_len_infer": null, "num_units": 512, "num_buckets": 5, "best_bleu_dir": "./nmt/nmt_attention_model/iwslt15_new/best_bleu"} The tst2012.json is :

{
  "attention": "scaled_luong",
  "attention_architecture": "standard",
  "batch_size": 128,
  "bpe_delimiter": null,
  "colocate_gradients_with_ops": true,
  "decay_factor": 0.5,
  "decay_steps": 1000,
  "dropout": 0.2,
  "encoder_type": "bi",
  "eos": "</s>",
  "forget_bias": 1.0,
  "infer_batch_size": 32,
  "init_weight": 0.1,
  "learning_rate": 1.0,
  "max_gradient_norm": 5.0,
  "metrics": ["bleu"],
  "num_buckets": 5,
  "num_layers": 2,
  "num_train_steps": 5000,
  "num_units": 512,
  "optimizer": "sgd",
  "residual": false,
  "share_vocab": false,
  "sos": "<s>",
  "source_reverse": false,
  "src_max_len": 50,
  "src_max_len_infer": null,
  "start_decay_step": 1000,
  "steps_per_external_eval": null,
  "steps_per_stats": 100,
  "tgt_max_len": 50,
  "tgt_max_len_infer": null,
  "time_major": true,
  "unit_type": "lstm",
  "beam_width": 10
}

Because of the difference between the ./nmt/nmt_model/hparams and the tst2012.json, I'm confused how to match them. Even though I put the checkpoint files and hparams file in my out_dir, it just evaluated not finetuned. Could you tell me how to modify my ./nmt/nmt_model/hparams to match the tst2012.json ? Thank you very much!!

fansiawang commented 7 years ago

@oahziur Excuse me, I have another question. If I want to change the strategy of learning rate during the training, I change the ./nmt/nmt_model/hparams file in the model directory and the json file. For example, the previous learning rate=0.5, start_decay_step=5000, the latest checkpoint is 3500. Now I want to let the start_decay_step=3500, I try to change the hparams file and json file, then re-run the same training command. But it still starts decay the learning rate on 5000, not 3500. Where did I wrong?

oahziur commented 7 years ago

@fansiawang Try adding your fine tuned keys here locally. We only allow update a fixed set of hparams by default for compatibility reason.

fansiawang commented 7 years ago

@oahziur It seems that GNMT cannot finetune on the existing model. If I want to change the learning rate or other parameters about learning strategy, I need to re-train a new model. If I want to pre-train a model on a big database and use another small database to finetune on it, how do I achieve it? I mean using the pre-trained model to initialize the parameters before training a new model.

oahziur commented 7 years ago

@fansiawang The use case should be possible with a small modification of the code.

For example, if you want to update the training source and learning rate, add ["learning_rate", "train_prefix"] to the updated_keys in nmt/nmt.py.

You should see logs like this when re-train with the updated hyper parameters:

# Updating hparams.num_train_steps: 12000 -> 24000
# Updating hparams.learning_rate: 1.0 -> 0.5
# Updating hparams.train_prefix: /tmp/nmt_data/train -> /tmp/nmt_data/tst2013

You need to increase the num_train_steps so it is greater than the pre-trained global steps.

cyzLoveDream commented 7 years ago

hello,@ I have successfully run the program, but the display in train_log is messy. How can I solve this problem? just like this " ?K" Hg諥 brain.Event:2觼c辘辝_q cf.Hg諥"萆F"

szm-R commented 6 years ago

Hi everyone, How can we resume the training when the last saved checkpoint is corrupted? My computer restarted and when I checked the last checkpoint showed 0-byte size. I deleted it and still, the code tries to resume the training from this last empty checkpoint. What should I do to make it switch to the one before last?

Keerthana-Manjunatha commented 6 years ago

I hope this helps https://machinelearningmastery.com/check-point-deep-learning-models-keras/

Trotts commented 5 years ago

Hi everyone, How can we resume the training when the last saved checkpoint is corrupted? My computer restarted and when I checked the last checkpoint showed 0-byte size. I deleted it and still, the code tries to resume the training from this last empty checkpoint. What should I do to make it switch to the one before last?

@szm2015 did you find a fix for this? Having the same issue atm

sandeepdahake commented 5 years ago

Hello everyone, I have trained a seq2seq tensorflow model for translating a sentence from English to Spanish. I trained a model for 4,20,000 steps, and save the model checkpoints successfully. My training data size for both English and Spanish sentences is 1.5 lakh. I want to add new data into my training dataset and start new model training from 4,20,000 steps. I am using sequence to sequence tensoflow model for this. how can I start new model training from the last checkpoint. Here is the link that I am following for the translation. https://github.com/MonicaVillanueva/English_Spanish_Translator

tensorflow / nmt

How to resume training (finetuning) on the checkpoint(saved) model? #51