Open fansiawang opened 7 years ago
yeah。i have the same question with u。@oahziur 。could you help us solve this problem?
The model will save everything to out_dir
. As long as out_dir
is not removed, re-run the same training command will continue the training start from the latest saved checkpoint.
@oahziur You are absolutely right! I resumed training successfully! Now I have another question.
I try to finetune new dataset on the saved checkpoint
. Refer to the previous continuing training method, I copy four files(checkpoint, translate.ckpt-11000.data-00000-of-00001, translate.ckpt-11000.index, translate.ckpt-11000.meta
) in a new out_dir
I just change the train data path and hparams file. I execute the following command:
python -m nmt.nmt \
--hparams_path=./nmt/standard_hparams/tst2012.json \
--src=vi --tgt=en \
--vocab_prefix=./nmt/nmt_data/iwslt15/vocab \
--train_prefix=./nmt/nmt_data/iwslt15/tst2012 \
--dev_prefix=./nmt/nmt_data/iwslt15/tst2013 \
--out_dir=./nmt/nmt_model
But it only tests the dev data, not starts training from the saved checkpoint. The log is as follows:
loaded infer model parameters from ./nmt/nmt_model/translate.ckpt-11000, time 0.12s
# External evaluation, global step 11000
decoding to output ./nmt/nmt_model/output_dev.
and the learning rate
is also a little strange:
saving hparams to ./nmt/nmt_model/hparams
# Final, step 11000 lr 0.000976562 step-time 0.00 wps 0.00K ppl 0.00, dev ppl 8.51, dev bleu 24.7, Fri Aug 4 12:13:45 2017
# Done training!, time 128s, Fri Aug 4 12:13:45 2017.
# Start evaluating saved best models.
The previous hparams file is iwslt15.json
, the new hparams file is tst2012.json
. The difference between iwslt15.json and tst2012.json is num_train_steps, start_decay_step and decay_steps
. But the GNMT doesn't use thetst2012.json
, it just tests the dev data.
So I'm confusing that which is the correct method to do the finetune. Can I need to modify other files to realize the correct finetune? Could you help me solve this problem? Thank you very much!!!
@fansiawang Do you have the ./nmt/nmt_model/hparams
file before you start the training? You can check if the parameters in ./nmt/nmt_model/hparams
matches your tst2012.json
Here is the code how we load hparams in the model.
@oahziur I have the ./nmt/nmt_model/hparams
file before I start the training, and I try to match the parameters in ./nmt/nmt_model/hparams
with the tst2012.json
. But the parameters in ./nmt/nmt_model/hparams
are different with tst2012.json
. The ./nmt/nmt_model/hparams
file is copied from previous trained model, which is I want to finetune on it.
For example, there is no parameter named best_bleu
in tst2012.json
, which exists in ./nmt/nmt_model/hparams
. The ./nmt/nmt_model/hparams
is as follows:
{"pass_hidden_state": true, "steps_per_stats": 100, "tgt": "en", "out_dir": "./nmt/nmt_model", "source_reverse": false, "sos": "<s>", "encoder_type": "bi", "best_bleu": 21.98009987821807, "tgt_vocab_size": 17191, "num_layers": 2, "optimizer": "sgd", "init_weight": 0.1, "tgt_vocab_file": "./nmt/nmt_data/iwslt15/vocab.en", "src_max_len_infer": null, "beam_width": 10, "src_vocab_size": 7709, "decay_factor": 0.5, "src_max_len": 50, "vocab_prefix": "./nmt/nmt_data/iwslt15/vocab", "share_vocab": false, "test_prefix": null, "attention_architecture": "standard", "bpe_delimiter": null, "epoch_step": 527, "infer_batch_size": 32, "src_vocab_file": "./nmt/nmt_data/iwslt15/vocab.vi", "colocate_gradients_with_ops": true, "learning_rate": 1.0, "start_decay_step": 1000, "unit_type": "lstm", "num_train_steps": 5000, "time_major": true, "dropout": 0.2, "attention": "scaled_luong", "tgt_max_len": 50, "batch_size": 128, "residual": false, "metrics": ["bleu"], "length_penalty_weight": 0.0, "train_prefix": "./nmt/nmt_data/iwslt15/train", "forget_bias": 1.0, "max_gradient_norm": 5.0, "num_residual_layers": 0, "log_device_placement": false, "random_seed": null, "src": "vi", "num_gpus": 1, "dev_prefix": "./nmt/nmt_data/iwslt15/tst2012", "max_train": 0, "steps_per_external_eval": null, "eos": "</s>", "decay_steps": 1000, "tgt_max_len_infer": null, "num_units": 512, "num_buckets": 5, "best_bleu_dir": "./nmt/nmt_attention_model/iwslt15_new/best_bleu"}
The tst2012.json
is :
{
"attention": "scaled_luong",
"attention_architecture": "standard",
"batch_size": 128,
"bpe_delimiter": null,
"colocate_gradients_with_ops": true,
"decay_factor": 0.5,
"decay_steps": 1000,
"dropout": 0.2,
"encoder_type": "bi",
"eos": "</s>",
"forget_bias": 1.0,
"infer_batch_size": 32,
"init_weight": 0.1,
"learning_rate": 1.0,
"max_gradient_norm": 5.0,
"metrics": ["bleu"],
"num_buckets": 5,
"num_layers": 2,
"num_train_steps": 5000,
"num_units": 512,
"optimizer": "sgd",
"residual": false,
"share_vocab": false,
"sos": "<s>",
"source_reverse": false,
"src_max_len": 50,
"src_max_len_infer": null,
"start_decay_step": 1000,
"steps_per_external_eval": null,
"steps_per_stats": 100,
"tgt_max_len": 50,
"tgt_max_len_infer": null,
"time_major": true,
"unit_type": "lstm",
"beam_width": 10
}
Because of the difference between the ./nmt/nmt_model/hparams
and the tst2012.json
, I'm confused how to match them. Even though I put the checkpoint files and hparams file in my out_dir, it just evaluated not finetuned. Could you tell me how to modify my ./nmt/nmt_model/hparams
to match the tst2012.json
? Thank you very much!!
@oahziur Excuse me, I have another question. If I want to change the strategy of learning rate during the training, I change the ./nmt/nmt_model/hparams
file in the model directory and the json file. For example, the previous learning rate=0.5, start_decay_step=5000, the latest checkpoint is 3500. Now I want to let the start_decay_step=3500, I try to change the hparams file and json file, then re-run the same training command. But it still starts decay the learning rate on 5000, not 3500. Where did I wrong?
@fansiawang Try adding your fine tuned keys here locally. We only allow update a fixed set of hparams by default for compatibility reason.
@oahziur It seems that GNMT cannot finetune on the existing model. If I want to change the learning rate or other parameters about learning strategy, I need to re-train a new model. If I want to pre-train a model on a big database and use another small database to finetune on it, how do I achieve it? I mean using the pre-trained model to initialize the parameters before training a new model.
@fansiawang The use case should be possible with a small modification of the code.
For example, if you want to update the training source and learning rate, add ["learning_rate", "train_prefix"]
to the updated_keys
in nmt/nmt.py
.
You should see logs like this when re-train with the updated hyper parameters:
# Updating hparams.num_train_steps: 12000 -> 24000
# Updating hparams.learning_rate: 1.0 -> 0.5
# Updating hparams.train_prefix: /tmp/nmt_data/train -> /tmp/nmt_data/tst2013
You need to increase the num_train_steps
so it is greater than the pre-trained global steps.
hello,@ I have successfully run the program, but the display in train_log is messy. How can I solve this problem? just like this " ?K" Hg諥 brain.Event:2觼c辘 辝_q cf.Hg諥"萆F"
Hi everyone, How can we resume the training when the last saved checkpoint is corrupted? My computer restarted and when I checked the last checkpoint showed 0-byte size. I deleted it and still, the code tries to resume the training from this last empty checkpoint. What should I do to make it switch to the one before last?
Hi everyone, How can we resume the training when the last saved checkpoint is corrupted? My computer restarted and when I checked the last checkpoint showed 0-byte size. I deleted it and still, the code tries to resume the training from this last empty checkpoint. What should I do to make it switch to the one before last?
@szm2015 did you find a fix for this? Having the same issue atm
Hello everyone, I have trained a seq2seq tensorflow model for translating a sentence from English to Spanish. I trained a model for 4,20,000 steps, and save the model checkpoints successfully. My training data size for both English and Spanish sentences is 1.5 lakh. I want to add new data into my training dataset and start new model training from 4,20,000 steps. I am using sequence to sequence tensoflow model for this. how can I start new model training from the last checkpoint. Here is the link that I am following for the translation. https://github.com/MonicaVillanueva/English_Spanish_Translator
For example, I have trained a model for 300,000 rounds, and save the model successfully. What if I want to go on training base on the saved model, saying, I want to train 300,000 more rounds. I can not find some documentation. Does anybody know the command details?