Error with checkpoint - Githubissues

michellegiang commented 6 years ago

Hi,

I have this problem when run python generate.py with the old version of fairseq-py (I used a model in https://github.com/nusnlp/mlconvgec2018 and train with pre-trained word embeddings ) and new version of PyTorch. Could you let me know how to solve it ? ( It seems close to #52)

Thank you for your support

michelle:~$ screen -r 117073.pts-19

++ source paths.sh +++++ dirname paths.sh ++++ cd . ++++ pwd +++ BASE_DIR=/home/michelle/mlc/mlconvgec2018 +++ DATA_DIR=/home/michelle/mlc/mlconvgec2018/data +++ MODEL_DIR=/home/michelle/mlc/mlconvgec2018/models +++ SCRIPTS_DIR=/home/michelle/mlc/mlconvgec2018/scripts +++ SOFTWARE_DIR=/home/michelle/mlc/mlconvgec2018/software ++ '[' 4 -ge 4 ']' ++ input_file=/home/michelle/mlc/mlconvgec2018/data/dev.tok.src ++ output_dir=/home/michelle/mlc/mlconvgec2018/result/augment_y_1 ++ device=2 ++ model_path=/home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000 ++ '[' 4 -eq 6 ']' ++ '[' -d /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000 ']' +++ tr '\n' ' ' +++ sed 's| ([^$])| --path \1|g' +++ ls /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint10.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint11.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint12.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint13.pt /home/v trinh/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint1.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint2.pt /home/michelle/mlc /mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint3.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint4.pt /home/michelle/mlc/mlconvge c2018/training/models/mlconv_embed/model1000/checkpoint5.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint6.pt /home/michelle/mlc/mlconvgec2018/tra ining/models/mlconv_embed/model1000/checkpoint7.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint8.pt /home/michelle/mlc/mlconvgec2018/training/mod els/mlconv_embed/model1000/checkpoint9.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_best.pt /home/michelle/mlc/mlconvgec2018/training/models/m lconv_embed/model1000/checkpoint_last.pt ++ models='/home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint10.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/check point11.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint12.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model100 0/checkpoint13.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint1.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/mo del1000/checkpoint2.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint3.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_emb ed/model1000/checkpoint4.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint5.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint6.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint7.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint8.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint9.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_best.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_last.pt /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint10.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint11.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint12.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint13.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint1.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint2.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint3.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint4.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint5.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint6.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint7.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint8.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint9.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_best.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_last.pt ++ FAIRSEQPY=/home/michelle/mlc/mlconvgec2018/software/fairseq-py ++ NBEST_RERANKER=/home/michelle/mlc/mlconvgec2018/software/nbest-reranker ++ beam=12 ++ nbest=12 ++ threads=12 ++ mkdir -p /home/michelle/mlc/mlconvgec2018/result/augment_y_1 ++ /home/michelle/mlc/mlconvgec2018/scripts/apply_bpe.py -c /home/michelle/mlc/mlconvgec2018/models/bpe_model/train.bpe.model ++ CUDA_VISIBLE_DEVICES=2 ++ python3.6 /home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py --no-progress-bar --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint10.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint11.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint12.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint13.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint1.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint2.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint3.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint4.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint5.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint6.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint7.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint8.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint9.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_best.pt --path /home/michelle/mlc/mlconvgec2018/training/models/mlconv_embed/model1000/checkpoint_last.pt --beam 12 --nbest 12 --interactive --workers 12 /home/michelle/mlc/mlconvgec2018/models/data_bin Traceback (most recent call last): File "/home/michelle/anaconda3/envs/michelle/lib/python3.6/site-packages/torch/nn/modules/module.py", line 514, in load_state_dict ownstate[name].copy(param) RuntimeError: inconsistent tensor size, expected tensor [30004 x 500] and src [28799 x 500] to have the same number of elements, but got 15002000 and 14399500 elements respectively at /opt/conda/conda-bld/pytorch_1518243271935/work/torch/lib/TH/generic/THTensorCopy.c:86

During handling of the above exception, another exception occurred:

Traceback (most recent call last): File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py", line 167, in main() File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py", line 41, in main models, dataset = utils.load_ensemble_for_inference(args.path, args.data) File "/home/michelle/mlc/mlconvgec2018/software/fairseq-py/fairseq/utils.py", line 128, in load_ensemble_for_inference model.load_state_dict(state['model']) File "/home/michelle/anaconda3/envs/michelle/lib/python3.6/site-packages/torch/nn/modules/module.py", line 519, in load_state_dict .format(name, own_state[name].size(), param.size())) RuntimeError: While copying the parameter named encoder.embed_tokens.weight, whose dimensions in the model are torch.Size([30004, 500]) and whose dimensions in the checkpoint are torch.Size([28799, 500]).

shamilcm commented 6 years ago

The run.sh only works for pre-trained models. If you are going to generate using your pre-trained model, run with the correct bin directory, i.e. the mlconvgec2018/training/processed/bin instead of

/home/michelle/mlc/mlconvgec2018/models/data_bin

What is the number of lines in mlconvgec2018/training/processed/bin/dict.src.txt and mlconvgec2018/training/processed/bin/dict.trg.txt ?

Also, when you run generate.py on trained model, just use a single model (checkpoint_best.pt would be a good choice). It is not good to include all checkpoints from 1 to 10 while decoding. If you want to do ensembling, either choose 2-4 last checkpoints, or run training multiple times from beginning and use the checkpoint_best.pt 's from all training runs.

michellegiang commented 6 years ago

Hi Shamil,

Thank you for your support !

If I generate using my pertained model, I just use your run.sh but replace "$MODEL_DIR/data_bin" by my model dir (ex mlconvgec2018/training/processed/bin) right ?
The number of lines in dict.src.txt is 28795 and dict.trg.txt is 28980
My development file is CoNLL 2013 test file, so should I use generate.py or generate.py -i ( seem generate.py works with binary file and generate.py -i work with raw file and CoNLL 2013 test file is binary)
I don't know why when use a single model it has below error ( while it works well if I replace "/home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model4.pt"by "/home/michelle/mlc/mlconvgec2018/models/mlconv_embed")

./run.sh "/home/michelle/mlc/mlconvgec2018/data/test/conll14st-test/conll14st-test.tok.src" "/home/michelle/mlc/test" 2 "/home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model4.pt"

++ source paths.sh +++++ dirname paths.sh ++++ cd . ++++ pwd +++ BASE_DIR=/home/michelle/mlc/mlconvgec2018 +++ DATA_DIR=/home/michelle/mlc/mlconvgec2018/data +++ MODEL_DIR=/home/michelle/mlc/mlconvgec2018/models +++ SCRIPTS_DIR=/home/michelle/mlc/mlconvgec2018/scripts +++ SOFTWARE_DIR=/home/michelle/mlc/mlconvgec2018/software ++ '[' 4 -ge 4 ']' ++ input_file=/home/michelle/mlc/mlconvgec2018/data/test/conll14st-test/conll14st-test.tok.src ++ output_dir=/home/michelle/mlc/test ++ device=2 ++ model_path=/home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt ++ '[' 4 -eq 6 ']' ++ '[' -d /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt ']' ++ '[' -f /home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt ']' ++ model=/home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model1.pt ++ FAIRSEQPY=/home/michelle/mlc/mlconvgec2018/software/fairseq-py ++ NBEST_RERANKER=/home/michelle/mlc/mlconvgec2018/software/nbest-reranker ++ beam=12 ++ nbest=12 ++ threads=12 ++ mkdir -p /home/michelle/mlc/test ++ /home/michelle/mlc/mlconvgec2018/scripts/apply_bpe.py -c /home/michelle/mlc/mlconvgec2018/models/bpe_model/train.bpe.model ++ CUDA_VISIBLE_DEVICES=2 ++ python3.6 /home/michelle/mlc/mlconvgec2018/software/fairseq-py/generate.py --no-progress-bar --path --beam 12 --nbest 12 --interactive --workers 12 /home/michelle/mlc/mlconvgec2018/models/data_bin usage: generate.py [-h] [--no-progress-bar] [--log-interval N] [--seed N] --path FILE [-s SRC] [-t TARGET] [-j N] [--max-positions N] [-i] [--batch-size N] [--gen-subset SPLIT] [--beam N] [--nbest N] [--max-len-a N] [--max-len-b N] [--remove-bpe] [--no-early-stop] [--unnormalized] [--cpu] [--no-beamable-mm] [--lenpen LENPEN] [--unk-replace-dict UNK_REPLACE_DICT] DIR generate.py: error: argument --path: expected one argument

shamilcm commented 6 years ago

If I generate using my pertained model, I just use your run.sh but replace "$MODEL_DIR/data_bin" by my model dir (ex mlconvgec2018/training/processed/bin) right ?

Yes. If you generate using a model that you trained (which uses the bin files that you generated with preprocess), then you have to use the same binary directory for decoding. This is because the dictionary files of the model that you trained and the model that you use for generation must match.

The number of lines in dict.src.txt is 28795 and dict.trg.txt is 28980

This indicates that the training data you used contains less than 30K unique tokens. It would be better if you used a lower vocabulary size for your model so that it can learn a better estimate for the UNK token. Otherwise, all the tokens that you see during training will be a known vocabulary word. The data used in the paper is a concatenation of Lang8 v2 and NUCLE, which has more than 30K unique BPE tokens. Hence, a vocab of 30K was okay.

My development file is CoNLL 2013 test file, so should I use generate.py or generate.py -i ( seem generate.py works with binary file and generate.py -i work with raw file and CoNLL 2013 test file is binary)

If you want to decode a file that is in your bin/ directory, i.e. test or valid file, then you can use generate.py without -i flag. However, if you want to decode the file from raw text file as done in the run.sh script use generate.py -i.

I don't know why when use a single model it has below error ( while it works well if I replace "/home/michelle/mlc/mlconvgec2018/models/mlconv_embed/model4.pt"by "/home/michelle/mlc/mlconvgec2018/models/mlconv_embed")

Sorry, there was a typo in the run.sh script. I have now fixed it here https://github.com/nusnlp/mlconvgec2018/commit/f7b9fca293f61f1d42915542d90e295b395e0531. Thanks.

michellegiang commented 6 years ago

Thank you very much !

So could you please let me know if I need to increase my unique tokens to 30K ? I mean, how can I solve the "RuntimeError: inconsistent tensor size, expected tensor [30004 x 500] and src [28799 x 500]"

This indicates that the training data you used contains less than 30K unique tokens. It would be better if you used a lower vocabulary size for your model so that it can learn a better estimate for the UNK token. Otherwise, all the tokens that you see during training will be a known vocabulary word. The data used in the paper is a concatenation of Lang8 v2 and NUCLE, which has more than 30K unique BPE tokens. Hence, a vocab of 30K was okay.

shamilcm commented 6 years ago

So could you please let me know if I need to increase my unique tokens to 30K ? I mean, how can I solve the "RuntimeError: inconsistent tensor size, expected tensor [30004 x 500] and src [28799 x 500]"

I believe this problem is due to using incorrect bin/ directory. Isn't the problem solved when you use the bin directory within training/?

shamilcm commented 6 years ago

Is the issue solved?

michellegiang commented 6 years ago

Hi Shamil,

thank you for your support. It is solved. I've finished training and test successfully

Regards, Michelle

nusnlp / mlconvgec2018

Error with checkpoint #4