microsoft / ProphetNet

A research project for natural language generation, containing the official implementations by MSRA NLC team.
MIT License
686 stars 109 forks source link

how can i generate summary from the given text with provided pretrained model? #1

Open pragnakalpdev6 opened 4 years ago

qiweizhen commented 4 years ago

For summarization task, 1) download CNN\DM fine-tuned checkpoint 2) preprocess your text with BERT-tokenization, and you can refer to our preprocess scripts 3) use fairseq-generate or fairseq-interactive to generate summarization for your given text. For fairseq-generate, you can refer to our generate scripts. For fairseq-interactive, you can easily generate summarization for a typed-in text interactively. Detailed instructions can be found in fairseq manual

pragnakalpdev6 commented 4 years ago

Thank you very much for your help and prompt reply. I go through the steps you listed out but it has generated output file worth 1.6MB. Its snippet is given below.

Namespace(beam=4, bpe=None, cpu=False, criterion='cross_entropy', data='gigaword/processed', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=80, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=4, optimizer='nag', path='/content/ProphetNet/gigaword/finetune_gigaword_checkpoints/prophetnet_large_160G_cnndm_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, results_path=None, retain_iter_history=False, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_prophetnet', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, truncate_source=False, unkpen=0, unnormalized=False, upsample_primary=1, user_dir='src/prophetnet', warmup_updates=0, weight_decay=0.0) | [src] dictionary: 30522 types | [tgt] dictionary: 30522 types | loaded 1951 examples from: gigaword/processed/test.src-tgt.src | loaded 1951 examples from: gigaword/processed/test.src-tgt.tgt | gigaword/processed test src-tgt 1951 examples | loading model(s) from /content/ProphetNet/gigaword/finetune_gigaword_checkpoints/prophetnet_large_160G_cnndm_model.pt S-1366 whoever says toys aren ' t educational hasn ' t been shopping lately . T-1366 think of messages toys send H-1366 -0.17664051055908203 whoever says toys aren ' t educational hasn ' t been shopping . [X_SEP] whoever says toys aren ' t educational hasn ' t been shopping lately . P-1366 -0.0529 -0.1348 -0.1275 -0.0345 -0.1054 -0.0683 -0.0950 -0.0436 -0.1044 -0.0799 -0.0715 -0.0655 -1.6817 -0.4113 -0.1809 -0.1931 -0.1575 -0.0473 -0.1059 -0.0730 -0.1007 -0.0353 -0.1043 -0.0820 -0.0557 -0.0528 -0.3674 -0.1087 -0.3816 S-1207 [UNK] [UNK] l ##ind ##ner watches her boys asleep in a sofa bed . T-1207 keeping together in tough times H-1207 -0.6526010632514954 l ##ind ##ner watches her boys asleep in a sofa bed . P-1207 -1.2911 -0.9395 -0.0799 -2.7697 -0.3412 -0.4151 -0.1209 -0.1017 -0.1471 -0.0789 -0.7264 -0.1832 -1.2892 S-1549 the caucus : [UNK] [UNK] . 1 ' s non g ##rata [UNK] [UNK] T-1549 convention notes and news H-1549 -0.581391453742981 [UNK] [UNK] . 1 ' s non g ##rata . [X_SEP] [UNK] [UNK] . 1 ' s non g ##rata . P-1549 -2.6477 -0.5196 -0.1022 -0.0812 -0.6129 -0.0908 -0.4354 -0.3716 -1.0338 -1.7429 -0.3405 -1.4666 -0.4404 -0.0920 -0.0825 -0.1598 -0.0841 -0.1997 -0.0676 -0.4207 -0.2273 -1.5711 S-111 result in a world cup group g match here on sun ##day . T-111 world cup : f ##rance 1 south k ##ore ##a 1 H-111 -2.092313528060913 result . [X_SEP] world cup group g . [X_SEP] . . . P-111 -3.8916 -2.6654 -1.0884 -3.8009 -0.9155 -2.1213 -0.8790 -1.5038 -0.7778 -3.7905 -2.0631 -1.1838 -2.5189 S-1259 this is the time of year when people often take golf lessons . T-1259 a lesson about lessons H-1259 -0.3009403645992279 this is the time of year when people often take golf lessons . [X_SEP] this is the time of year when people often take golf lessons . P-1259 -0.9481 -0.0945 -0.1565 -0.1362 -0.1012 -0.1434 -0.1318 -0.2480 -0.1617 -0.0740 -0.0145 -0.0571 -0.1288 -0.2310 -2.0211 -0.3342 -0.6133 -0.4769 -0.2382 -0.3035 -0.3397 -0.3627 -0.2266 -0.0938 -0.0265 -0.0476 -0.1112 -0.6042 S-1305 for j ##udi b [UNK] ##ss , a single word changed everything . T-1305 a ceremonial event evolve ##s into a wedding H-1305 -0.5971478819847107 for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . [X_SEP] for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . [X_SEP] for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . P-1305 -1.4792 -0.1719 -1.1859 -0.1573 -0.5069 -0.2628 -0.4174 -1.1489 -0.6239 -0.1061 -0.4192 -0.3845 -1.7359 -0.4360 -1.5304 -0.3226 -0.9057 -0.1976 -0.9780 -0.1620 -0.5133 -0.0507 -0.5228 -0.1837 -0.3661 -2.2918 -0.5154 -0.0911 -0.4235 -0.3387 -0.6043 -0.2976 -1.4205 -0.4051 -0.4760 -0.2437 -0.7788 -0.1538 -0.5351 -0.0429 -0.5296 -0.1848 -0.2981 -2.1378 -0.4968 -0.0748 -0.3894 -0.3276 -0.8252 -0.2653 -1.1904 -0.3535 -0.5116 -1.2739 S-1513 cape district attorney s ##c ##ru ##tin ##ized by grand jury [UNK] [UNK] T-1513 grand jury s ##c ##ru ##tin ##izes <[UNK]> <[UNK]> da

pragnakalpdev6 commented 4 years ago

now I am able to summarize the text. but the problem is that the output is like extractive summarization then abstractive. May be it can be because of i used eval.py file from unilm as there were no file present at the given link. needed help to summarize text in abstractive manner

monk1337 commented 4 years ago

@qiweizhen How can I use the pre-trained model to generate questions for my own dataset, Just inference, Not training or finetuning on own data.

qiweizhen commented 4 years ago

@pragnakalpdev6 Actually it's trained as a abstractive summarization model. Perhaps it behaves like a extractive model because of the difference of your input-text corpus with the CNN/DM corpus, and generating a sentence from your given text is easier. You may try the gigaword fine-tuned checkpoint and see does it work better.

qiweizhen commented 4 years ago

@monk1337 same as discussed above, but use Squad question generation fine-tuned checkpoint instead.

cddpeter commented 4 years ago

@qiweizhen the link provided for evaluate question generation is not valid. do you have the code ? Thanks

monk1337 commented 4 years ago

@cddpeter You can download it from here : https://github.com/microsoft/unilm/tree/master/unilm-v1/src/qg

monk1337 commented 4 years ago

@qiweizhen Thank you for the reply, I tried your instructions and it worked. But I want to try pre-trained model on my raw data ( I don't have labels for that ) but in eval file you are proving test pa and test qa for eval. How can I pass a corpus in .txt file with multiple paragraphs and get the questions for each paragraph in output file if I don't have labels ( questions ) for that file .

cddpeter commented 4 years ago

@monk1337 Thanks.

cddpeter commented 4 years ago

@monk1337 I got an error when I run the evaluation file, ValueError: unsupported hash type md5. Did you have this issue when you run it?

pragnakalpdev6 commented 4 years ago

@cddpeter no I didn't get the error you mentioned above I got some other errors but somehow I managed to solve it and thanks for your reply @qiweizhen

monk1337 commented 4 years ago

@qiweizhen Any suggestion

@qiweizhen Thank you for the reply, I tried your instructions and it worked. But I want to try pre-trained model on my raw data ( I don't have labels for that ) but in eval file you are proving test pa and test qa for eval. How can I pass a corpus in .txt file with multiple paragraphs and get the questions for each paragraph in output file if I don't have labels ( questions ) for that file .

pragnakalpdev6 commented 4 years ago

@qiweizhen summarization not working well generates same file as input i have used both eval.py files to summarize the text but think something is missing in that script or you should provide new eval.py file

sivakumar1604 commented 4 years ago

Thank you very much for your help and prompt reply. I go through the steps you listed out but it has generated output file worth 1.6MB. Its snippet is given below.

Namespace(beam=4, bpe=None, cpu=False, criterion='cross_entropy', data='gigaword/processed', dataset_impl=None, decoding_format=None, diverse_beam_groups=-1, diverse_beam_strength=0.5, empty_cache_freq=0, force_anneal=None, fp16=False, fp16_init_scale=128, fp16_scale_tolerance=0.0, fp16_scale_window=None, gen_subset='test', iter_decode_eos_penalty=0.0, iter_decode_force_max_iter=False, iter_decode_max_iter=10, lazy_load=False, left_pad_source='True', left_pad_target='False', lenpen=1.0, load_alignments=False, log_format=None, log_interval=1000, lr_scheduler='fixed', lr_shrink=0.1, match_source_len=False, max_len_a=0, max_len_b=200, max_sentences=80, max_source_positions=1024, max_target_positions=1024, max_tokens=None, memory_efficient_fp16=False, min_len=1, min_loss_scale=0.0001, model_overrides='{}', momentum=0.99, nbest=1, no_beamable_mm=False, no_early_stop=False, no_progress_bar=False, no_repeat_ngram_size=0, num_shards=1, num_workers=4, optimizer='nag', path='/content/ProphetNet/gigaword/finetune_gigaword_checkpoints/prophetnet_large_160G_cnndm_model.pt', prefix_size=0, print_alignment=False, print_step=False, quiet=False, raw_text=False, remove_bpe=None, replace_unk=None, required_batch_size_multiple=8, results_path=None, retain_iter_history=False, sacrebleu=False, sampling=False, sampling_topk=-1, sampling_topp=-1.0, score_reference=False, seed=1, shard_id=0, skip_invalid_size_inputs_valid_test=False, source_lang=None, target_lang=None, task='translation_prophetnet', temperature=1.0, tensorboard_logdir='', threshold_loss_scale=None, tokenizer=None, truncate_source=False, unkpen=0, unnormalized=False, upsample_primary=1, user_dir='src/prophetnet', warmup_updates=0, weight_decay=0.0) | [src] dictionary: 30522 types | [tgt] dictionary: 30522 types | loaded 1951 examples from: gigaword/processed/test.src-tgt.src | loaded 1951 examples from: gigaword/processed/test.src-tgt.tgt | gigaword/processed test src-tgt 1951 examples | loading model(s) from /content/ProphetNet/gigaword/finetune_gigaword_checkpoints/prophetnet_large_160G_cnndm_model.pt S-1366 whoever says toys aren ' t educational hasn ' t been shopping lately . T-1366 think of messages toys send H-1366 -0.17664051055908203 whoever says toys aren ' t educational hasn ' t been shopping . [X_SEP] whoever says toys aren ' t educational hasn ' t been shopping lately . P-1366 -0.0529 -0.1348 -0.1275 -0.0345 -0.1054 -0.0683 -0.0950 -0.0436 -0.1044 -0.0799 -0.0715 -0.0655 -1.6817 -0.4113 -0.1809 -0.1931 -0.1575 -0.0473 -0.1059 -0.0730 -0.1007 -0.0353 -0.1043 -0.0820 -0.0557 -0.0528 -0.3674 -0.1087 -0.3816 S-1207 [UNK] [UNK] l ##ind ##ner watches her boys asleep in a sofa bed . T-1207 keeping together in tough times H-1207 -0.6526010632514954 l ##ind ##ner watches her boys asleep in a sofa bed . P-1207 -1.2911 -0.9395 -0.0799 -2.7697 -0.3412 -0.4151 -0.1209 -0.1017 -0.1471 -0.0789 -0.7264 -0.1832 -1.2892 S-1549 the caucus : [UNK] [UNK] . 1 ' s non g ##rata [UNK] [UNK] T-1549 convention notes and news H-1549 -0.581391453742981 [UNK] [UNK] . 1 ' s non g ##rata . [X_SEP] [UNK] [UNK] . 1 ' s non g ##rata . P-1549 -2.6477 -0.5196 -0.1022 -0.0812 -0.6129 -0.0908 -0.4354 -0.3716 -1.0338 -1.7429 -0.3405 -1.4666 -0.4404 -0.0920 -0.0825 -0.1598 -0.0841 -0.1997 -0.0676 -0.4207 -0.2273 -1.5711 S-111 result in a world cup group g match here on sun ##day . T-111 world cup : f ##rance 1 south k ##ore ##a 1 H-111 -2.092313528060913 result . [X_SEP] world cup group g . [X_SEP] . . . P-111 -3.8916 -2.6654 -1.0884 -3.8009 -0.9155 -2.1213 -0.8790 -1.5038 -0.7778 -3.7905 -2.0631 -1.1838 -2.5189 S-1259 this is the time of year when people often take golf lessons . T-1259 a lesson about lessons H-1259 -0.3009403645992279 this is the time of year when people often take golf lessons . [X_SEP] this is the time of year when people often take golf lessons . P-1259 -0.9481 -0.0945 -0.1565 -0.1362 -0.1012 -0.1434 -0.1318 -0.2480 -0.1617 -0.0740 -0.0145 -0.0571 -0.1288 -0.2310 -2.0211 -0.3342 -0.6133 -0.4769 -0.2382 -0.3035 -0.3397 -0.3627 -0.2266 -0.0938 -0.0265 -0.0476 -0.1112 -0.6042 S-1305 for j ##udi b [UNK] ##ss , a single word changed everything . T-1305 a ceremonial event evolve ##s into a wedding H-1305 -0.5971478819847107 for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . [X_SEP] for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . [X_SEP] for j ##udi b [UNK] ##ss , j ##udi b [UNK] ##ss is a single word . P-1305 -1.4792 -0.1719 -1.1859 -0.1573 -0.5069 -0.2628 -0.4174 -1.1489 -0.6239 -0.1061 -0.4192 -0.3845 -1.7359 -0.4360 -1.5304 -0.3226 -0.9057 -0.1976 -0.9780 -0.1620 -0.5133 -0.0507 -0.5228 -0.1837 -0.3661 -2.2918 -0.5154 -0.0911 -0.4235 -0.3387 -0.6043 -0.2976 -1.4205 -0.4051 -0.4760 -0.2437 -0.7788 -0.1538 -0.5351 -0.0429 -0.5296 -0.1848 -0.2981 -2.1378 -0.4968 -0.0748 -0.3894 -0.3276 -0.8252 -0.2653 -1.1904 -0.3535 -0.5116 -1.2739 S-1513 cape district attorney s ##c ##ru ##tin ##ized by grand jury [UNK] [UNK] T-1513 grand jury s ##c ##ru ##tin ##izes <[UNK]> <[UNK]> da

Hi @pragnakalpdev6 As a beginner, it's hard for me understand how to use these codes for Abstractive summarization. Every where it's mention translation. Could you please share some high level steps or upload shareable code to your github. Thanks.

chrisdoyleIE commented 4 years ago

@sivakumar1604 are you a beginner to python in general, or specifically abstractive summarisation?

If just summarisation with a strong NLP foundation, I found it useful to adapt the pytorch tutorial on the transformer to a summarisation tasks (https://pytorch.org/tutorials/beginner/transformer_tutorial.html).

You ask for high level steps, what task specifically do you want to solve?

sivakumar1604 commented 4 years ago

@sivakumar1604 are you a beginner to python in general, or specifically abstractive summarisation?

If just summarisation with a strong NLP foundation, I found it useful to adapt the pytorch tutorial on the transformer to a summarisation tasks (https://pytorch.org/tutorials/beginner/transformer_tutorial.html).

You ask for high level steps, what task specifically do you want to solve?

Hi Thanks for your reply, i'm working Abstractive summarization with ProphetNet. It's not clear for me, from the github documentation. It's seems the examples provided mainy focuses on translation task. Probably because I'm new to fairseq n Pytorch.I've mostly used Tensorflow with Keras till now.

I have theoretical understanding of RNN, LSTM, attention, encoder-decoder networks etc. Also implemented abstractive summarization with Transformers package on CNNDM dataset.

If there's any notebook/blog post on how to use Prophet Net for abstractive summarization on domain specific dataset, that would be great.

qiweizhen commented 4 years ago

@qiweizhen summarization not working well generates same file as input i have used both eval.py files to summarize the text but think something is missing in that script or you should provide new eval.py file

To evaluate QG results, two parts of code should be downloaded from other repos:

1) QG dataset original repo 2) codes for Unilm postprocess

This happed because the original evaluation files are not changed to be used, and we recommond users to cite their repos rather than redistributing.

qiweizhen commented 4 years ago

@sivakumar1604 are you a beginner to python in general, or specifically abstractive summarisation? If just summarisation with a strong NLP foundation, I found it useful to adapt the pytorch tutorial on the transformer to a summarisation tasks (https://pytorch.org/tutorials/beginner/transformer_tutorial.html). You ask for high level steps, what task specifically do you want to solve?

Hi Thanks for your reply, i'm working Abstractive summarization with ProphetNet. It's not clear for me, from the github documentation. It's seems the examples provided mainy focuses on translation task. Probably because I'm new to fairseq n Pytorch.I've mostly used Tensorflow with Keras till now.

I have theoretical understanding of RNN, LSTM, attention, encoder-decoder networks etc. Also implemented abstractive summarization with Transformers package on CNNDM dataset.

If there's any notebook/blog post on how to use Prophet Net for abstractive summarization on domain specific dataset, that would be great.

This happed because it's the way how Fairseq shows its results. S means source, T means golden target and H means generated hypothesis. You can fetch the desired part from it manually, for example,

grep ^H $OUTPUT_FILE | cut -c 3- | sort -n | cut -f3- | sed "s/ ##//g" > cnndm/sort_hypo$SUFFIX.txt

By the way, your source input sentences seem not to be a paragraph to summarize ...

GenTxt commented 4 years ago

Hello:

Thanks for the cool repo and models. Have everything working 100% with the above mentioned models and cnndm/processed binary files but encounter a problem when trying to use 'fairseq-generate' or 'fair-interactive' with the default 'prophetnet_large_pretrained_160G_14epoch_model.pt'

I would like to generate summaries from this model using input text files without having to fine-tune a checkpoint. When trying to use the above model with cnndm/processed it generates the following error:

{"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]}

KeyError: 'best_loss'

Are there options that will enable access to this model without having to fine-tune a checkpoint from scratch?

Would the use of --raw-text option be helpful here?

Cheers.

smita181298 commented 4 years ago

Hello @GenTxt @yuyan2do .I am also getting the same error when trying to generate summary using given prophetnet model.Did you find the solution ?

Traceback (most recent call last): File "/usr/local/bin/fairseq-generate", line 8, in sys.exit(cli_main()) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/generate.py", line 199, in cli_main main(args) File "/usr/local/lib/python3.6/dist-packages/fairseq_cli/generate.py", line 47, in main task=task, File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 179, in load_model_ensemble ensemble, args, _task = load_model_ensemble_and_task(filenames, arg_overrides, task) File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 190, in load_model_ensemble_and_task state = load_checkpoint_to_cpu(filename, arg_overrides) File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 166, in load_checkpoint_to_cpu state = _upgrade_state_dict(state) File "/usr/local/lib/python3.6/dist-packages/fairseq/checkpoint_utils.py", line 300, in _upgrade_state_dict {"criterion_name": "CrossEntropyCriterion", "best_loss": state["best_loss"]} KeyError: 'best_loss'

NamraRehman commented 3 years ago

@sivakumar1604 are you a beginner to python in general, or specifically abstractive summarisation? If just summarisation with a strong NLP foundation, I found it useful to adapt the pytorch tutorial on the transformer to a summarisation tasks (https://pytorch.org/tutorials/beginner/transformer_tutorial.html). You ask for high level steps, what task specifically do you want to solve?

Hi Thanks for your reply, i'm working Abstractive summarization with ProphetNet. It's not clear for me, from the github documentation. It's seems the examples provided mainy focuses on translation task. Probably because I'm new to fairseq n Pytorch.I've mostly used Tensorflow with Keras till now.

I have theoretical understanding of RNN, LSTM, attention, encoder-decoder networks etc. Also implemented abstractive summarization with Transformers package on CNNDM dataset.

If there's any notebook/blog post on how to use Prophet Net for abstractive summarization on domain specific dataset, that would be great.

Hi @sivakumar1604 did you find out the way to use the ProphetNet for abstractive summarization? I wanna use this to summarize the Legal Court Data using this library. But new in NLP need help.

umareefarooq commented 3 years ago

@sivakumar1604 check https://github.com/thatguyfig/python-text-summary/blob/master/summarizer.py