nlpyang / PreSumm

code for EMNLP 2019 paper Text Summarization with Pretrained Encoders
MIT License
1.29k stars 465 forks source link

Preprocess another dataset #186

Closed matt9704 closed 4 years ago

matt9704 commented 4 years ago

Hi! I want to implement your model on another dataset. But in this dataset, each article has one-sentence summary, which is different from CNN/Dailymail dataset (each with three-sentence summary). Can I follow the preprocess steps for CNN/Dailymail to prepocess my dataset?

matt9704 commented 4 years ago

They are selecting only one of those summaries anyways using a greedy algorithm during preprocessing. So if you only have one summary it will just select that one

Umm. I think they combine three sentences as one summary for an article, not just choosing one of those sentences as a summary. For example, this is an article in CNN/Dailymail dataset. There are three highlights below the article text and the concatenation of them is the summary. image

SebastianVeile commented 4 years ago

Ah, you are correct. That is my bad. However, I believe you have the answer to your question then. Only sentences after "@highlight " will be considered for the tgt and if you only have 1 "@highlight" it will use only that sentence, hence it will be possible to preprocess all your data to only contain 1 highlight

matt9704 commented 4 years ago

Ah, you are correct. That is my bad. However, I believe you have the answer to your question then. Only sentences after "@highlight " will be considered for the tgt and if you only have 1 "@highlight" it will use only that sentence, hence it will be possible to preprocess all your data to only contain 1 highlight

Maybe it'll work. I'll have a try. Thanks!