Open mrzpx opened 5 years ago
Hello, do you know how to download the lang-8-en-1.0.zip data?
have you found it?
Oh, I have found the lang-8 data myself. I means Pre-trained WikiText-103 language model .Thank you!
@mrzpx Hi, I have downloaded the Pre-trained WikiText-103 language model here https://github.com/pytorch/fairseq/blob/master/examples/language_model/README.md. Then I use it for testing model. But I met the error: Missing key(s) in state_dict: "decoder.embed_tokens.weight", "decoder.adaptive_softmax.head.weight". Unexpected key(s) in state_dict: "decoder.embed_tokens._float_tensor", "decoder.embed_tokens.embeddings.0.0.weight", "decoder.embed_tokens.embeddings.0.1.weight", "decoder.embed_tokens.embeddings.1.0.weight", "decoder.embed_tokens.embeddings.1.1.weight", "decoder.embed_tokens.embeddings.2.0.weight", "decoder.embed_tokens.embeddings.2.1.weight", "decoder.adaptive_softmax.head._float_tensor", "decoder.adaptive_softmax.head.word_proj.weight", "decoder.adaptive_softmax.head.class_proj.weight". size mismatch for decoder.adaptive_softmax.tail.0.0.weight: copying a param of torch.Size([256, 1024]) from checkpoint, where the shape is torch.Size([1024, 256]) in current model. size mismatch for decoder.adaptive_softmax.tail.1.0.weight: copying a param of torch.Size([64, 1024]) from checkpoint, where the shape is torch.Size([1024, 64]) in current model.
Did you meet the error?
@mrzpx @rgcottrell Hello, how should I pre-train WikiText-103 language model. Hope to receive your reply.
Maybe you can try to train 5-gram language model with kenlm and en-wiki corpus.
@young-zonglin Did you do it in this way? And did it work?
I just used the pre-trained model from the Fairseq repo. It's possible that it's been updated for PyTorch 1.0 and new model distribution formats and no longer functional for the older version used when the project was started.
Here's the link to the wiki model from the GitHub repo history, I think:
https://dl.fbaipublicfiles.com/fairseq/models/wiki103_fconv_lm.tar.bz2
@rgcottrell
Thanks for your reply. There is an another problem.
When I ran the command as follow:
python3 ./generate.py \
../data-bin/lang-8-fairseq \
--path ../checkpoints/lang-8-fairseq-cnn/checkpoint_best.pt \
--batch-size 128 \
--beam 5 \
--nbest 1 \
Something error happened:
args: {'path': None, 'data': None, 'task': 'language_modeling', 'raw_text': True, 'sample_break_mode': 'eos', 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'num_shards': 1, 'shard_id': 0, 'max_tokens': None, 'max_sentences': None, 'tokens_per_sample': 1024}
args.data: None
Traceback (most recent call last):
File "./generate.py", line 237, in
It seems that I should provide the language model dict path. But when I replaced "./generate.py " with "../fairseq/generate.py", it worked. Should I provide the language model dict path or replace "./generate.py " with "../fairseq/generate.py"
The custom ./generate.py script is used to add the language model and fluency scorer to the output, so you will want to use it to get the best version of the results. If you look at some of the other batch files, you will see some things like:
generate-lang8-cnn-rawtext.bat
python .\generate.py^
..\test\lang-8^
--path ..\checkpoints\lang-8-fairseq-cnn\checkpoint_best.pt^
--batch-size 128^
--beam 5^
--nbest 12^
--lang-model-data ..\data-bin\wiki103^
--lang-model-path ..\data-bin\wiki103\wiki103.pt^
--raw-text^
--source-lang en^
--target-lang gec
So once you have the wiki language model installed, try adding the --lang-model-data
and lang-model-path
options and see if that works.
Sorry but it's been a while since I ran this code and I'm having some trouble remember exactly how things worked.
@rgcottrell So nice of you. I added the --lang-model-data and lang-model-path options and that worked.I could get translations. But when it calculated gleu score, something error happened in the gleu.py.
Traceback (most recent call last):
File "./generate.py", line 237, in
Could you please tell me how to solved it. Thanks a lot
I'm not sure what the issue is , but it could be that we didn't fix generate.py after finishing the GLEU score. You could try just commenting out the calls here. The GLEU score is only used to benchmark results against other papers and isn't part of the core algorithm.
You could also take a look at interactive.py, which was the final result of the project and what we used to show off the results with both a web-based and command line tool. It's a little smarter about how GLEU score processing is invoked.
@wulouzhu I also had an array overstep problem. How did you solve it?
where do i get this pre-trained model?
Hi,could you help me, when I run the command: ./generate-lang8-cnn.bat ,it prompts me for missing files test.label.en.txt and test.label.label.gec.txt,how do I fix it.
@WangQi1024 These are dictionary files. Have you followed steps of "Preparing Data" and "Pre-process Data"? These two steps will create dictionary files for you.
where do i get this pre-trained model?