marian-nmt / marian

Fast Neural Machine Translation in C++
https://marian-nmt.github.io
Other
1.22k stars 228 forks source link

How to continue training from the last checkpoint in nematus #344

Closed LinuxBeginner closed 3 years ago

LinuxBeginner commented 3 years ago

Hi, I am training the model in colab, so I will need to retrained the model. I used the following training script to train the model:

THEANO_FLAGS=mode=FAST_RUN,floatX=float32,device=$device,gpuarray.preallocate=0.8 python $nematus_home/nematus/train.py \
    --model $working_dir/model.npz \
    --datasets $data_dir/train.bpe.$src $data_dir/train.bpe.$trg \
    --valid_datasets $data_dir/dev.bpe.$src $data_dir/dev.bpe.$trg \
    --dictionaries $data_dir/train.bpe.$src.json $data_dir/train.bpe.$trg.json \
    --valid_script $script_dir/validate.sh \
    --dim_word 512 \
    --dim 1024 \
    --lrate 0.0001 \
    --optimizer adam \
    --maxlen 50 \
    --batch_size 80 \
    --valid_batch_size 40 \
    --validFreq 10000 \
    --dispFreq 1000 \
    --saveFreq 10000 \
    --sampleFreq 10000 \
    --tie_decoder_embeddings \
    --layer_normalisation \
    --dec_base_recurrence_transition_depth 8 \
    --enc_recurrence_transition_depth 4

After a while, the training stopped because I used my daily 12 hours limit in colab. I was able to save only one model (model.npz.data-00000-of-00001):

enmn.bpe               model.npz.json       truecase-model.en
model.npz.data-00000-of-00001  model.npz.meta       truecase-model.mn
model.npz.index            model.npz.progress.json

As per the comments here the parameter --model $working_dir/model.npz should do the work.

So, without any changes, when I trained the above script, the training starts again from epoch 0.

I also tried by adding
--reload $working_dir/model.npz.data-00000-of-00001 in the script, but it gave me an error.

which parameter should I use to continue training from the last checkpoint?