microsoft / ProphetNet

A research project for natural language generation, containing the official implementations by MSRA NLC team.
MIT License
654 stars 105 forks source link

Abstractive Summarization using ProphetNet #14

Closed harshithbelagur closed 4 years ago

harshithbelagur commented 4 years ago

I'm following these steps to summarize my document -

1) download CNN\DM fine-tuned checkpoint 2) preprocess your text with BERT-tokenization, and you can refer to our preprocess scripts 3) use fairseq-generate or fairseq-interactive to generate summarization for your given text. For fairseq-generate, you can refer to our generate scripts. For fairseq-interactive, you can easily generate summarization for a typed-in text interactively. Detailed instructions can be found in fairseq manual

What is the --task argument for summarization?

Also, would this be sufficient if my processed input is in 2.txt?

fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task summarization_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

ShoubhikBanerjee commented 4 years ago

@harshithbelagur I guess task should be translation_prophetnet as mentioned in readme

And the data should be in format of .src and .tgt as mentioned in Data Preprocess for other datasets.

Please let me know, if it actually helped.

harshithbelagur commented 4 years ago

@ShoubhikBanerjee I'm actually not trying to fine-tune the model. I'm only trying to use it to summarize a document I have. I've used - convert_cased2uncased('1.txt', '2.txt') to convert as shown in the Data Preprocess step. Further, 2.txt is fed in as -

!SUFFIX=_ck7_pelt1.0_test_beam4 !BEAM=4 !LENPEN=1.0 !OUTPUT_FILE=summary.txt !SCORE_FILE=score.txt

!fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

All of this is done on Colab. I would love to know if there was a mistake at some point in what I'm doing. Thanks!

ShoubhikBanerjee commented 4 years ago

@harshithbelagur are you getting any error after applying --task translation_prophetnet ?

Moreover, Could you please have a look here . As it seems that the fairseq-generate requires the first element as the processed path (like mentioned : "cnndm/processed"). which is a set of .bin and dictionary of src and tgt files and not .txt files.

For the inference you have to use the directory of that "processed" files.

Hope I am right.

harshithbelagur commented 4 years ago

@ShoubhikBanerjee Here's my entire code from Colab, could you suggest the exact changes I will have to make?

!git clone https://github.com/microsoft/ProphetNet.git !pip install torch==1.3.0 !pip install fairseq==v0.9.0

from google.colab import drive drive.mount('/content/drive')

from pytorch_transformers import BertTokenizer import tqdm

def convert_cased2uncased(fin, fout): fin = open(fin, 'r', encoding='utf-8') fout = open(fout, 'w', encoding='utf-8') tok = BertTokenizer.from_pretrained('bert-base-uncased') for line in tqdm.tqdm(fin.readlines()): org = line.strip().replace(" ##", "") new = tok.tokenize(org) new_line = " ".join(new) fout.write('{}\n'.format(new_line))

convert_cased2uncased('1.txt', '2.txt')

!SUFFIX=_ck7_pelt1.0_test_beam4 !BEAM=4 !LENPEN=1.0 !OUTPUT_FILE=summary.txt !SCORE_FILE=score.txt

!fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE

PS. 1.txt is where the document that I need to be summarized exists. Thank you so much for this!

ShoubhikBanerjee commented 4 years ago

Okay, Sorry for being late,

Step 1: You need to prepare your "processed" data...

Step 2. Test your own data

If it fails again, kindly check that .bin, .idx and a dict.src.txt and dict.tgt.txt files exists in that path.

I am sorry, but I am also a learner, don't feel bad if I wont get your issue :)

harshithbelagur commented 4 years ago

Thanks a ton for this Shoubhik. It almost seems to work. Where do I pass the text that I need to summarize though?

Do I pass it when I run the preprocess command? If yes, it asks for another input and if I use Ctrl+C to abort, it causes a KeyboardInterrupt and creates multiple bin files, the dict files, but no .idx files.

FileNotFoundError: Dataset not found: test (ProphetNet/src/cnndm/processed) is the error on running the generate command.

Following is the code I used -

!fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20

!fairseq-generate ProphetNet/src/cnndm/processed --path org_data/prophetnet_large_160G_cnndm_model.pt --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --batch-size 32 --gen-subset test --beam 4 --num-workers 4 --min-len 45 --max-len-b 110 --no-repeat-ngram-size 3 --lenpen 1.0 2>&1 > summary.txt

Thank you so much for doing this!

ShoubhikBanerjee commented 4 years ago

As in your code : !fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20 The --validpref valid is named "valid" but in your data its "dev" , so kindly clear all the previously generated files, and change the downloaded "dev" file names to "valid" and run the preprocess step i.e the above command again.

harshithbelagur commented 4 years ago

@ShoubhikBanerjee Could you please review the file on Colab here - https://colab.research.google.com/drive/1_0M2wevqz3pHnuoo-LS4KcTzNs4sZfFo?usp=sharing. The files are loaded

ShoubhikBanerjee commented 4 years ago

Hi @harshithbelagur ,

I don't see the output of your last step. i.e. "!fairseq-generate...", It just shows : 73% 263/360 [3:23:59<1:28:16, 54.61s/it, wps=47].

Did it worked? Did you got 6 .bin files, 6 .idx files and a dict.src.txt and dict.tgt.txt files in your "ProphetNet/src/cnndm/processed" ?

And moreover I can't edit, it has given only "view" permission.

harshithbelagur commented 4 years ago

It seems to be working fine now. Thank you so much @ShoubhikBanerjee