Closed LinuxBeginner closed 3 years ago
Hi, I am training the model in colab, so I will need to retrained the model. I used the following training script to train the model:
THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,gpuarray.preallocate=0.8 python $nematus_home/nematus/train.py \ --model $working_dir/model.npz \ --datasets $data_dir/train.bpe.$src $data_dir/train.bpe.$trg \ --valid_datasets $data_dir/dev.bpe.$src $data_dir/dev.bpe.$trg \ --dictionaries $data_dir/train.bpe.$src.json $data_dir/train.bpe.$trg.json \ --valid_script $script_dir/validate.sh \ --dim_word 512 \ --dim 1024 \ --lrate 0.0001 \ --optimizer adam \ --maxlen 50 \ --batch_size 80 \ --valid_batch_size 40 \ --validFreq 10000 \ --dispFreq 1000 \ --saveFreq 10000 \ --sampleFreq 10000 \ --tie_decoder_embeddings \ --layer_normalisation \ --dec_base_recurrence_transition_depth 8 \ --enc_recurrence_transition_depth 4
After a while, the training stopped because I used my daily 12 hours limit in colab. I was able to save only one model (model.npz.data-00000-of-00001):
enmn.bpe model.npz.json truecase-model.en model.npz.data-00000-of-00001 model.npz.meta truecase-model.mn model.npz.index model.npz.progress.json
As per the comments here the parameter --model $working_dir/model.npz should do the work.
--model $working_dir/model.npz
So, without any changes, when I trained the above script, the training starts again from epoch 0.
I also tried by adding --reload $working_dir/model.npz.data-00000-of-00001 in the script, but it gave me an error.
--reload $working_dir/model.npz.data-00000-of-00001
which parameter should I use to continue training from the last checkpoint?
Hi, I am training the model in colab, so I will need to retrained the model. I used the following training script to train the model:
After a while, the training stopped because I used my daily 12 hours limit in colab. I was able to save only one model (model.npz.data-00000-of-00001):
As per the comments here the parameter
--model $working_dir/model.npz
should do the work.So, without any changes, when I trained the above script, the training starts again from epoch 0.
I also tried by adding
--reload $working_dir/model.npz.data-00000-of-00001
in the script, but it gave me an error.which parameter should I use to continue training from the last checkpoint?