mahnazkoupaee / WikiHow-Dataset

A Large Scale Text Summarization Dataset
330 stars 41 forks source link

Lead-3 baseline - how are paragraphs defined #6

Closed lambdaofgod closed 5 years ago

lambdaofgod commented 5 years ago

We created the Lead-3 baseline by extracting the first sentence of each paragraph and concatenated them to create the summary

How do you precisely define paragraphs?

I tried splitting the input text on newlines, but it doesn't seem to correspond exactly to paragraph structure - there are more newlines than paragraph splits.

mahnazkoupaee commented 5 years ago

Paragraphs in the original WikiHow articles are a piece of text starting with a bold line. Each paragraph has multiple lines.

The wikihowAll.csv file consists of the concatenation of all paragraphs as the articles and the bold lines as the reference summaries and newlines do not necessarily define paragraphs. However, the other version of data, wikihowSep.csv file, contains separate paragraphs. So, to calculate the Lead-3 baseline, you can use this version of data. The names of the articles are unique; therefore you can easily find the paragraphs associated with each article (paragraphs of an article have the same “title” value). A sentence tokenizer can then be applied to extract the first sentences of the paragraphs.