rgcottrell / pytorch-human-performance-gec

A PyTorch implementation of "Reaching Human-level Performance in Automatic Grammatical Error Correction: An Empirical Study"
Apache License 2.0
50 stars 19 forks source link

Pre-trained WikiText-103 language model #3

Open mrzpx opened 5 years ago

mrzpx commented 5 years ago

where do i get this pre-trained model?

wulouzhu commented 5 years ago

Hello, do you know how to download the lang-8-en-1.0.zip data?

wulouzhu commented 5 years ago

have you found it?

mrzpx commented 5 years ago

https://sites.google.com/site/naistlang8corpora/

wulouzhu commented 5 years ago

Oh, I have found the lang-8 data myself. I means Pre-trained WikiText-103 language model .Thank you!

wulouzhu commented 5 years ago

@mrzpx Hi, I have downloaded the Pre-trained WikiText-103 language model here https://github.com/pytorch/fairseq/blob/master/examples/language_model/README.md. Then I use it for testing model. But I met the error: Missing key(s) in state_dict: "decoder.embed_tokens.weight", "decoder.adaptive_softmax.head.weight". Unexpected key(s) in state_dict: "decoder.embed_tokens._float_tensor", "decoder.embed_tokens.embeddings.0.0.weight", "decoder.embed_tokens.embeddings.0.1.weight", "decoder.embed_tokens.embeddings.1.0.weight", "decoder.embed_tokens.embeddings.1.1.weight", "decoder.embed_tokens.embeddings.2.0.weight", "decoder.embed_tokens.embeddings.2.1.weight", "decoder.adaptive_softmax.head._float_tensor", "decoder.adaptive_softmax.head.word_proj.weight", "decoder.adaptive_softmax.head.class_proj.weight". size mismatch for decoder.adaptive_softmax.tail.0.0.weight: copying a param of torch.Size([256, 1024]) from checkpoint, where the shape is torch.Size([1024, 256]) in current model. size mismatch for decoder.adaptive_softmax.tail.1.0.weight: copying a param of torch.Size([64, 1024]) from checkpoint, where the shape is torch.Size([1024, 64]) in current model.

Did you meet the error?

wulouzhu commented 5 years ago

@mrzpx @rgcottrell Hello, how should I pre-train WikiText-103 language model. Hope to receive your reply.

young-zonglin commented 5 years ago

Maybe you can try to train 5-gram language model with kenlm and en-wiki corpus.

wulouzhu commented 5 years ago

@young-zonglin Did you do it in this way? And did it work?

rgcottrell commented 5 years ago

I just used the pre-trained model from the Fairseq repo. It's possible that it's been updated for PyTorch 1.0 and new model distribution formats and no longer functional for the older version used when the project was started.

rgcottrell commented 5 years ago

Here's the link to the wiki model from the GitHub repo history, I think:

https://dl.fbaipublicfiles.com/fairseq/models/wiki103_fconv_lm.tar.bz2

wulouzhu commented 5 years ago

@rgcottrell Thanks for your reply. There is an another problem. When I ran the command as follow: python3 ./generate.py \ ../data-bin/lang-8-fairseq \ --path ../checkpoints/lang-8-fairseq-cnn/checkpoint_best.pt \ --batch-size 128 \ --beam 5 \ --nbest 1 \ Something error happened: args: {'path': None, 'data': None, 'task': 'language_modeling', 'raw_text': True, 'sample_break_mode': 'eos', 'output_dictionary_size': -1, 'self_target': False, 'future_target': False, 'past_target': False, 'num_shards': 1, 'shard_id': 0, 'max_tokens': None, 'max_sentences': None, 'tokens_per_sample': 1024} args.data: None Traceback (most recent call last): File "./generate.py", line 237, in main(args) File "./generate.py", line 94, in main fluency_scorer = FluencyScorer(args.lang_model_path, args.lang_model_data) File "/home/nlp/WJF/pytorch-human-performance-gec-master/fairseq-scripts/fluency_scorer.py", line 58, in init self.task = tasks.setup_task(self.args) File "/home/nlp/WJF/pytorch-human-performance-gec-master/fairseq/fairseq/tasks/init.py", line 19, in setup_task return TASK_REGISTRY[args.task].setup_task(args) File "/home/nlp/WJF/pytorch-human-performance-gec-master/fairseq/fairseq/tasks/language_modeling.py", line 92, in setup_task dictionary = Dictionary.load(os.path.join(args.data, 'dict.txt')) File "/usr/lib64/python3.6/posixpath.py", line 80, in join a = os.fspath(a) TypeError: expected str, bytes or os.PathLike object, not NoneType

It seems that I should provide the language model dict path. But when I replaced "./generate.py " with "../fairseq/generate.py", it worked. Should I provide the language model dict path or replace "./generate.py " with "../fairseq/generate.py"

rgcottrell commented 5 years ago

The custom ./generate.py script is used to add the language model and fluency scorer to the output, so you will want to use it to get the best version of the results. If you look at some of the other batch files, you will see some things like:

generate-lang8-cnn-rawtext.bat

python .\generate.py^
    ..\test\lang-8^
    --path ..\checkpoints\lang-8-fairseq-cnn\checkpoint_best.pt^
    --batch-size 128^
    --beam 5^
    --nbest 12^
    --lang-model-data ..\data-bin\wiki103^
    --lang-model-path ..\data-bin\wiki103\wiki103.pt^
    --raw-text^
    --source-lang en^
    --target-lang gec

So once you have the wiki language model installed, try adding the --lang-model-data and lang-model-path options and see if that works.

Sorry but it's been a while since I ran this code and I'm having some trouble remember exactly how things worked.

wulouzhu commented 5 years ago

@rgcottrell So nice of you. I added the --lang-model-data and lang-model-path options and that worked.I could get translations. But when it calculated gleu score, something error happened in the gleu.py.

Traceback (most recent call last): File "./generate.py", line 237, in main(args) File "./generate.py", line 219, in main gleu_score = [g for g in gleu_scores][0][0] 100; File "./generate.py", line 219, in gleu_score = [g for g in gleu_scores][0][0] 100; File "/home/nlp/WJF/pytorch-human-performance-gec-master/fairseq-scripts/gleu.py", line 199, in run_iterations this_stats = [s for s in self.gleu_stats(i, r_ind=ref)] File "/home/nlp/WJF/pytorch-human-performance-gec-master/fairseq-scripts/gleu.py", line 199, in this_stats = [s for s in self.gleu_stats(i, r_ind=ref)] File "/home/nlp/WJF/pytorch-human-performance-gec-master/fairseq-scripts/gleu.py", line 122, in gleu_stats rlen = self.rlens[i][r_ind] IndexError: list index out of range

Could you please tell me how to solved it. Thanks a lot

rgcottrell commented 5 years ago

I'm not sure what the issue is , but it could be that we didn't fix generate.py after finishing the GLEU score. You could try just commenting out the calls here. The GLEU score is only used to benchmark results against other papers and isn't part of the core algorithm.

You could also take a look at interactive.py, which was the final result of the project and what we used to show off the results with both a web-based and command line tool. It's a little smarter about how GLEU score processing is invoked.

1228589545 commented 4 years ago

@wulouzhu I also had an array overstep problem. How did you solve it?

WangQi1024 commented 4 years ago

where do i get this pre-trained model?

Hi,could you help me, when I run the command: ./generate-lang8-cnn.bat ,it prompts me for missing files test.label.en.txt and test.label.label.gec.txt,how do I fix it.

tianfeichen commented 4 years ago

@WangQi1024 These are dictionary files. Have you followed steps of "Preparing Data" and "Pre-process Data"? These two steps will create dictionary files for you.