nlpyang / PreSumm

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders
MIT License
1.29k stars 465 forks source link

Why is `summary_size=3` inside `greedy_selection` when creating BERT data? #158

Open seanswyi opened 4 years ago

seanswyi commented 4 years ago

The title is basically the question, but to elaborate I'm going through the code step-by-step so that I can create the BERT-style data used in this model to use with other summarization datasets as well.

I noticed inside data_builder._format_to_bert the value passed to the argument summary_size for the function greedy_selection is 3.

Why is this hard-coded like this? If my understanding is correct, summary_size basically refers to how many reference sentences there are for each src/tgt pair. There are many samples where summary_size != 3.

AyeshaSarwar commented 4 years ago

Even if I give it summary_size other than 3, it produces candidate summary with only 3 sentences.

AyeshaSarwar commented 4 years ago

and this could also be the reason why the rouge scores are low for my other dataset.