Closed harshithbelagur closed 4 years ago
@harshithbelagur I guess task should be translation_prophetnet as mentioned in readme
And the data should be in format of .src and .tgt as mentioned in Data Preprocess for other datasets.
Please let me know, if it actually helped.
@ShoubhikBanerjee I'm actually not trying to fine-tune the model. I'm only trying to use it to summarize a document I have. I've used - convert_cased2uncased('1.txt', '2.txt') to convert as shown in the Data Preprocess step. Further, 2.txt is fed in as -
!SUFFIX=_ck7_pelt1.0_test_beam4 !BEAM=4 !LENPEN=1.0 !OUTPUT_FILE=summary.txt !SCORE_FILE=score.txt
!fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE
All of this is done on Colab. I would love to know if there was a mistake at some point in what I'm doing. Thanks!
@harshithbelagur are you getting any error after applying --task translation_prophetnet ?
Moreover, Could you please have a look here . As it seems that the fairseq-generate requires the first element as the processed path (like mentioned : "cnndm/processed"). which is a set of .bin and dictionary of src and tgt files and not .txt files.
For the inference you have to use the directory of that "processed" files.
Hope I am right.
@ShoubhikBanerjee Here's my entire code from Colab, could you suggest the exact changes I will have to make?
!git clone https://github.com/microsoft/ProphetNet.git !pip install torch==1.3.0 !pip install fairseq==v0.9.0
from google.colab import drive drive.mount('/content/drive')
from pytorch_transformers import BertTokenizer import tqdm
def convert_cased2uncased(fin, fout): fin = open(fin, 'r', encoding='utf-8') fout = open(fout, 'w', encoding='utf-8') tok = BertTokenizer.from_pretrained('bert-base-uncased') for line in tqdm.tqdm(fin.readlines()): org = line.strip().replace(" ##", "") new = tok.tokenize(org) new_line = " ".join(new) fout.write('{}\n'.format(new_line))
convert_cased2uncased('1.txt', '2.txt')
!SUFFIX=_ck7_pelt1.0_test_beam4 !BEAM=4 !LENPEN=1.0 !OUTPUT_FILE=summary.txt !SCORE_FILE=score.txt
!fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task translation_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE
PS. 1.txt is where the document that I need to be summarized exists. Thank you so much for this!
Okay, Sorry for being late,
Step 1: You need to prepare your "processed" data...
Download data from the provided link of UniLM.
Extract the file and copy the files named dev.src, dev.tgt, test.src, test.tgt, train.src, tran.tgt to a folder, say, unilm_processed
Run preprocess_cnn_dm.py and save the outputs to a file, say PreProcessedData
Run the command : fairseq-preprocess \ --user-dir prophetnet \ --task translation_prophetnet \ --source-lang src --target-lang tgt \ --trainpref <path_to_PreProcessedData>/train --validpref <path_to_PreProcessedData>/dev --testpref <path_to_PreProcessedData>/test \ --destdir cnndm/processed --srcdict ./vocab.txt --tgtdict ./vocab.txt \ --workers 20
In your <path_to_cnndm/procesed> you will see there it will generate some binarized files, extension like : 6 .bin files , 6 .idx files and a dict.src.txt and dict.tgt.txt
Step 2. Test your own data
If it fails again, kindly check that .bin, .idx and a dict.src.txt and dict.tgt.txt files exists in that path.
I am sorry, but I am also a learner, don't feel bad if I wont get your issue :)
Thanks a ton for this Shoubhik. It almost seems to work. Where do I pass the text that I need to summarize though?
Do I pass it when I run the preprocess command? If yes, it asks for another input and if I use Ctrl+C to abort, it causes a KeyboardInterrupt and creates multiple bin files, the dict files, but no .idx files.
FileNotFoundError: Dataset not found: test (ProphetNet/src/cnndm/processed) is the error on running the generate command.
Following is the code I used -
!fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20
!fairseq-generate ProphetNet/src/cnndm/processed --path org_data/prophetnet_large_160G_cnndm_model.pt --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --batch-size 32 --gen-subset test --beam 4 --num-workers 4 --min-len 45 --max-len-b 110 --no-repeat-ngram-size 3 --lenpen 1.0 2>&1 > summary.txt
Thank you so much for doing this!
As in your code : !fairseq-preprocess --user-dir ProphetNet/src/prophetnet --task translation_prophetnet --source-lang src --target-lang tgt --trainpref train --validpref valid --testpref test --destdir ProphetNet/src/cnndm/processed --srcdict ProphetNet/src/vocab.txt --tgtdict ProphetNet/src/vocab.txt --workers 20
The --validpref valid
is named "valid" but in your data its "dev" , so kindly clear all the previously generated files, and change the downloaded "dev" file names to "valid" and run the preprocess step i.e the above command again.
@ShoubhikBanerjee Could you please review the file on Colab here - https://colab.research.google.com/drive/1_0M2wevqz3pHnuoo-LS4KcTzNs4sZfFo?usp=sharing. The files are loaded
Hi @harshithbelagur ,
I don't see the output of your last step. i.e. "!fairseq-generate...", It just shows : 73% 263/360 [3:23:59<1:28:16, 54.61s/it, wps=47].
Did it worked? Did you got 6 .bin files, 6 .idx files and a dict.src.txt and dict.tgt.txt files in your "ProphetNet/src/cnndm/processed" ?
And moreover I can't edit, it has given only "view" permission.
It seems to be working fine now. Thank you so much @ShoubhikBanerjee
I'm following these steps to summarize my document -
1) download CNN\DM fine-tuned checkpoint 2) preprocess your text with BERT-tokenization, and you can refer to our preprocess scripts 3) use fairseq-generate or fairseq-interactive to generate summarization for your given text. For fairseq-generate, you can refer to our generate scripts. For fairseq-interactive, you can easily generate summarization for a typed-in text interactively. Detailed instructions can be found in fairseq manual
What is the --task argument for summarization?
Also, would this be sufficient if my processed input is in 2.txt?
fairseq-generate 2.txt --path content/drive/My Drive/prophetnet_large_160G_cnndm_model.pt --user-dir prophetnet --task summarization_prophetnet --batch-size 80 --gen-subset test --beam $BEAM --num-workers 4 --lenpen $LENPEN 2>&1 > $OUTPUT_FILE