I want to use these pre-trained models for summarization with my custom input text.
Let's say, I have 10 articles that I want to summarize using BertSumExt, so first I have to preprocess my raw inputs.
The README has following steps:
Step 1 Download Stories
Here, it will be my custom articles.
Step 2. Download Stanford CoreNLP
This part has no issue.
Step 3. Sentence Splitting and Tokenization
python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH RAW_PATH is the directory containing articles, JSON_PATH is the target directory to save the generated json files
Step 4. Format to Simpler Json Files
python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH RAW_PATH is the directory containing tokenized files, JSON_PATH is the target directory to save the generated json files, MAP_PATH is the directory containing the urls files (../urls)
Step 5. Format to PyTorch Files
python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -lower -n_cpus 1 -log_file ../logs/preprocess.log
So, my question is, what is the MAP_PATH i.e. mapping urls, I don't have anything related to this for my custom data.
My task is very simple, given raw text input(i.e. article text), get the summary.
Can I get understandable steps for this?
e.g. input: This is the article ....
output: Summary...
Hi folks,
I want to use these pre-trained models for summarization with my custom input text. Let's say, I have 10 articles that I want to summarize using BertSumExt, so first I have to preprocess my raw inputs.
The README has following steps: Step 1 Download Stories Here, it will be my custom articles.
Step 2. Download Stanford CoreNLP This part has no issue.
Step 3. Sentence Splitting and Tokenization
python preprocess.py -mode tokenize -raw_path RAW_PATH -save_path TOKENIZED_PATH
RAW_PATH is the directory containing articles, JSON_PATH is the target directory to save the generated json filesStep 4. Format to Simpler Json Files
python preprocess.py -mode format_to_lines -raw_path RAW_PATH -save_path JSON_PATH -n_cpus 1 -use_bert_basic_tokenizer false -map_path MAP_PATH
RAW_PATH is the directory containing tokenized files, JSON_PATH is the target directory to save the generated json files, MAP_PATH is the directory containing the urls files (../urls)Step 5. Format to PyTorch Files
python preprocess.py -mode format_to_bert -raw_path JSON_PATH -save_path BERT_DATA_PATH -lower -n_cpus 1 -log_file ../logs/preprocess.log
So, my question is, what is the MAP_PATH i.e. mapping urls, I don't have anything related to this for my custom data. My task is very simple, given raw text input(i.e. article text), get the summary.
Can I get understandable steps for this?
e.g. input: This is the article .... output: Summary...