nlpyang / PreSumm

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders
MIT License
1.29k stars 465 forks source link

How to do inference using pretrained bertsum models? #243

Open sumitmishra27598 opened 2 years ago

sumitmishra27598 commented 2 years ago

Hi folks,

I want to use these pre-trained models for summarization with my custom input text. Let's say, I have 10 articles that I want to summarize using BertSumExt, so first I have to preprocess my raw inputs.

The README has following steps: Step 1 Download Stories Here, it will be my custom articles.

Step 2. Download Stanford CoreNLP This part has no issue.

Step 3. Sentence Splitting and Tokenization python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH RAW_PATH is the directory containing articles, JSON_PATH is the target directory to save the generated json files

Step 4. Format to Simpler Json Files python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH RAW_PATH is the directory containing tokenized files, JSON_PATH is the target directory to save the generated json files, MAP_PATH is the directory containing the urls files (../urls)

Step 5. Format to PyTorch Files python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -lower -n_cpus 1 -log_file ../logs/preprocess.log

So, my question is, what is the MAP_PATH i.e. mapping urls, I don't have anything related to this for my custom data. My task is very simple, given raw text input(i.e. article text), get the summary.

Can I get understandable steps for this?

e.g. input: This is the article .... output: Summary...